Thread: PATCH: pgbench - option to build using ppoll() for larger connectioncounts
This version of the patch attempts to address the feedback for the previous
submission on 28-Nov-2017
This patch enables building pgbench to use ppoll() instead of select()
to allow for more than (FD_SETSIZE - 10) connections. As implemented,
when using ppoll(), the only connection limitation is system resources.
“… ppoll() is to poll() as pselect() is to select(). Since the
existing code uses select(), why not convert to poll() rather than
ppoll()?”
Time in pgbench is measured in microseconds.
The select() uses microseconds.
The ppoll() and pselect() call use nanoseconds
The poll() call uses milliseconds.
In order to not downgrade time resolution, ppoll() is used instead of poll().
Without this patch, one is limited to '(FD_SETSIZE - 10)’ number of connections.
Example of something that fails without this patch but works with the patch:
Without the patch:
$ pgbench -j 1500 -c 1500
invalid number of clients: "1500"
With the patch:
$ pgbench -j 1500 -c 1500
starting vacuum...end.
progress: 12.0 s, 782.3 tps, lat 917.507 ms stddev 846.929
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 2000
query mode: simple
number of clients: 1500
number of threads: 1500
number of transactions per client: 10
number of transactions actually processed: 15000/15000
latency average = 1180.216 ms
latency stddev = 855.126 ms
tps = 658.674816 (including connections establishing)
tps = 765.289160 (excluding connections establishing)
--
Doug Rady
Amazon Aurora, RDS PostgreSQL
radydoug@amazon.com
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, > This version of the patch attempts to address the feedback for the > previous submission on 28-Nov-2017 Please avoid recreating a thread, but rather respond to the previous one, I missed this post. The overall function-based implementation with limited ifdefs seems readable and maintainable. I think that it rather improves the code in places by hiding details. Patch applies with one warning: pgbench11-ppoll-v6.patch:141: trailing whitespace. set_socket(sockets, PQsocket(state[i].con), i);<SPACE> warning: 1 line adds whitespace errors. The patch implies to run "autoconf" to regenerate the configure script. Compilation with "ppoll" ok, globale & local make check ok. I do not have hardware which allows to run with over 1024 clients, so I cannot say that I tested the case. Compilation without "ppoll" gets a warning: pgbench.c:5645:1: warning: ‘clear_socket’ defined but not used... clear_socket(socket_array *sa, int fd, int idx) The "clear_socket" function is not used in this case. I'd suggest to remove it from the prototypes, remove or comment the unused implementation in the select section, and keep the one in the ppoll section. Or use it if it is needed. Or inline it in the "init_socket_array" function where it is used just once. I'm not sure of the name "socket_array", because it is an array only in the ppoll case. Maybe "sockets_t" or "socket_set"? Or something else? Maybe "init_socket_array" should be named "clear_...". I would suggest not to include sys/select.h when using ppoll, as it is a useless include this case. I.e. move includes in the ifdef USE_PPOLL section? Please do not remove comments, eg: - /* identify which client sockets should be checked for input */ On #endif or #else, especially large scope ones, I would have a comment to say about what it is, eg at the minimum: #else /* select(2) implementation */ #endif /* USE_PPOLL */ On: +#if defined(USE_PPOLL) +#ifdef POLLRDHUP Use just one "ifdef" style? There should be a comment about what this sections are providing, eg: /* ppoll(2) implementation for "socket_array" functions */ There should be an empty line and possibly a comment explaining why POLLRDHUP may not be there and/or why this define is necessary. With select you always initialize the timeout, but not with ppoll. Use a common approach in the implementations? The "finishCon" function addition seems totally unrelated to this patch. Although I agree that this function improves the code, it is refactoring and does not really belong here. -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 1/24/18, 00:00, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: Hello Doug, Hello Fabien, > This version of the patch attempts to address the feedback for the > previous submission on 28-Nov-2017 Please avoid recreating a thread, but rather respond to the previous one, I missed this post. Got it. The overall function-based implementation with limited ifdefs seems readable and maintainable. I think that it rather improves the code in places by hiding details. Patch applies with one warning: pgbench11-ppoll-v6.patch:141: trailing whitespace. set_socket(sockets, PQsocket(state[i].con), i);<SPACE> warning: 1 line adds whitespace errors. Fixed. The patch implies to run "autoconf" to regenerate the configure script. Yes. I should have included this in the description. Compilation with "ppoll" ok, globale & local make check ok. I do not have hardware which allows to run with over 1024 clients, so I cannot say that I tested the case. Compilation without "ppoll" gets a warning: pgbench.c:5645:1: warning: ‘clear_socket’ defined but not used... clear_socket(socket_array *sa, int fd, int idx) The "clear_socket" function is not used in this case. I'd suggest to remove it from the prototypes, remove or comment the unused implementation in the select section, and keep the one in the ppoll section. Or use it if it is needed. Or inline it in the "init_socket_array" function where it is used just once. "clear_socket" removed. Functionality inlined into "init_socket_array" I'm not sure of the name "socket_array", because it is an array only in the ppoll case. Maybe "sockets_t" or "socket_set"? Or something else? Changed "socket_array" to "socket_set" Maybe "init_socket_array" should be named "clear_...". Renamed "init_socket_array" to "clear_socket_set" I would suggest not to include sys/select.h when using ppoll, as it is a useless include this case. I.e. move includes in the ifdef USE_PPOLL section? Done. Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. Please do not remove comments, eg: - /* identify which client sockets should be checked for input */ Restored. On #endif or #else, especially large scope ones, I would have a comment to say about what it is, eg at the minimum: #else /* select(2) implementation */ #endif /* USE_PPOLL */ On: +#if defined(USE_PPOLL) +#ifdef POLLRDHUP Use just one "ifdef" style? There should be a comment about what this sections are providing, eg: /* ppoll(2) implementation for "socket_array" functions */ There should be an empty line and possibly a comment explaining why POLLRDHUP may not be there and/or why this define is necessary. Removed unneeded #ifdef around POLLRDHUP With select you always initialize the timeout, but not with ppoll. Use a common approach in the implementations? Fixed init & passing of timeout. The "finishCon" function addition seems totally unrelated to this patch. Although I agree that this function improves the code, it is refactoring and does not really belong here. Removed. -- Fabien. Thank you!! doug
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
ISTM that the v7 patch version you sent is identical to v6. -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
This time with the revised patch file: pgbench11-ppoll-v8.patch On 1/24/18, 00:00, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: Hello Doug, Hello Fabien, > This version of the patch attempts to address the feedback for the > previous submission on 28-Nov-2017 Please avoid recreating a thread, but rather respond to the previous one, I missed this post. Got it. The overall function-based implementation with limited ifdefs seems readable and maintainable. I think that it rather improves the code in places by hiding details. Patch applies with one warning: pgbench11-ppoll-v6.patch:141: trailing whitespace. set_socket(sockets, PQsocket(state[i].con), i);<SPACE> warning: 1 line adds whitespace errors. Fixed. The patch implies to run "autoconf" to regenerate the configure script. Yes. I should have included this in the description. Compilation with "ppoll" ok, globale & local make check ok. I do not have hardware which allows to run with over 1024 clients, so I cannot say that I tested the case. Compilation without "ppoll" gets a warning: pgbench.c:5645:1: warning: ‘clear_socket’ defined but not used... clear_socket(socket_array *sa, int fd, int idx) The "clear_socket" function is not used in this case. I'd suggest to remove it from the prototypes, remove or comment the unused implementation in the select section, and keep the one in the ppoll section. Or use it if it is needed. Or inline it in the "init_socket_array" function where it is used just once. "clear_socket" removed. Functionality inlined into "init_socket_array" I'm not sure of the name "socket_array", because it is an array only in the ppoll case. Maybe "sockets_t" or "socket_set"? Or something else? Changed "socket_array" to "socket_set" Maybe "init_socket_array" should be named "clear_...". Renamed "init_socket_array" to "clear_socket_set" I would suggest not to include sys/select.h when using ppoll, as it is a useless include this case. I.e. move includes in the ifdef USE_PPOLL section? Done. Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. Please do not remove comments, eg: - /* identify which client sockets should be checked for input */ Restored. On #endif or #else, especially large scope ones, I would have a comment to say about what it is, eg at the minimum: #else /* select(2) implementation */ #endif /* USE_PPOLL */ On: +#if defined(USE_PPOLL) +#ifdef POLLRDHUP Use just one "ifdef" style? There should be a comment about what this sections are providing, eg: /* ppoll(2) implementation for "socket_array" functions */ There should be an empty line and possibly a comment explaining why POLLRDHUP may not be there and/or why this define is necessary. Removed unneeded #ifdef around POLLRDHUP With select you always initialize the timeout, but not with ppoll. Use a common approach in the implementations? Fixed init & passing of timeout. The "finishCon" function addition seems totally unrelated to this patch. Although I agree that this function improves the code, it is refactoring and does not really belong here. Removed. -- Fabien. Thank you!! doug
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, > This time with the revised patch file: pgbench11-ppoll-v8.patch Patch applies cleanly. Compiles cleanly and runs fine in both ppoll & select cases. I'm okay with having a preferred ppoll implementation because of its improved capability. A few minor additional comments/suggestions: Cpp has an #elif that could be used to manage the ppoll/select alternative. It is already used elsewhere in the file. Or not. I must admit that I'm not fond of the alloc_socket_set trick with MAXCLIENTS, especially without any comment. I'd suggest to just have two distinct functions in their corresponding sections. I would add a comment that free_socket_set code is common to both versions, and maybe consider moving it afterwards. Or maybe just duplicate if in each section for homogeneity. It looks like error_on_socket and ignore_socket should return a boolean instead of an int. Also, maybe simplify the implementation of the later by avoiding the ?: expression. ISTM that the error_on_socket function in the ppoll case deserves some comments, especially on the condition. > [...] Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. I'm okay with that. I'm wondering whether there should be a way to force using one or the other when both are available. Not sure. -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 1/25/18, 14:46, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: Hello Doug, Hello Fabien, > This time with the revised patch file: pgbench11-ppoll-v8.patch Patch applies cleanly. Compiles cleanly and runs fine in both ppoll & select cases. I'm okay with having a preferred ppoll implementation because of its improved capability. A few minor additional comments/suggestions: Cpp has an #elif that could be used to manage the ppoll/select alternative. It is already used elsewhere in the file. Or not. I must admit that I'm not fond of the alloc_socket_set trick with MAXCLIENTS, especially without any comment. I'd suggest to just have two distinct functions in their corresponding sections. Made a distinct function for each section. I would add a comment that free_socket_set code is common to both versions, and maybe consider moving it afterwards. Or maybe just duplicate if in each section for homogeneity. Duplicated. It looks like error_on_socket and ignore_socket should return a boolean instead of an int. Also, maybe simplify the implementation of the later by avoiding the ?: expression. ISTM that the error_on_socket function in the ppoll case deserves some comments, especially on the condition. Added comment. Changed output to be more compatible with socket() error check. Extra ppoll() specific error info only output when debug flag set ... not sure if this is OK or should just remove the extrainfo. > [...] Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. I'm okay with that. I'm wondering whether there should be a way to force using one or the other when both are available. Not sure. Added option to force use of select(2) via: -DUSE_SELECT -- Fabien. Thank you!! doug
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, Patch applies, compiles, tests ok. > > [...] Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. > > I'm okay with that. I'm wondering whether there should be a way to force > using one or the other when both are available. Not sure. > > Added option to force use of select(2) via: -DUSE_SELECT USE_SELECT could mean something somewhere. Maybe use something more specific like PGBENCH_USE_SELECT? Having this macro available simplifies testing. I'm not sure why you do the following trick, could you explain? +#undef USE_SELECT +#define USE_SELECT In the select implementation you do: return (socket_set *) pg_malloc0(sizeof(socket_set) * MAXCLIENTS); but ISTM that socket_set is already an fd_set which represents a set of clients, so allocating a number of it is needed. The initial implementation just does "fs_set input_mask", whetever the number of clients, and it works fine. -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 1/26/18, 15:00, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: Hello Doug, Hello Fabien, Patch applies, compiles, tests ok. > > [...] Replaced USE_PPOLL with HAVE_PPOLL as having both seems redundant. > > I'm okay with that. I'm wondering whether there should be a way to force > using one or the other when both are available. Not sure. > > Added option to force use of select(2) via: -DUSE_SELECT USE_SELECT could mean something somewhere. Maybe use something more specific like PGBENCH_USE_SELECT? Having this macro available simplifies testing. Changed to PGBENCH_USE_SELECT I'm not sure why you do the following trick, could you explain? +#undef USE_SELECT +#define USE_SELECT This was due to compiler complaint about USE_SELECT being redefined. Have replaced that "trick" with a new #define POLL_USING_SELECT which is used elsewhere in pgbench instead. In the select implementation you do: return (socket_set *) pg_malloc0(sizeof(socket_set) * MAXCLIENTS); but ISTM that socket_set is already an fd_set which represents a set of clients, so allocating a number of it is needed. The initial implementation just does "fs_set input_mask", whetever the number of clients, and it works fine. Ugh. Yes, for socket() only one (1) fd_set is needed. Fixed. -- Fabien. Thank you, again!! doug
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, > I'm not sure why you do the following trick, could you explain? > +#undef USE_SELECT > +#define USE_SELECT > > This was due to compiler complaint about USE_SELECT being redefined. > Have replaced that "trick" with a new #define POLL_USING_SELECT which is used elsewhere in pgbench instead. Ok, why not. Another option to avoid the warning and a new name could have been to "#ifndef X #define X #endif /* !X */" Patch applies cleanly, compiles cleanly for both options, local & global "make check" ok. Switched to "Ready". -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 2018-01-28 23:02:57 +0000, Rady, Doug wrote: > diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c > index 31ea6ca06e..689f15a772 100644 > --- a/src/bin/pgbench/pgbench.c > +++ b/src/bin/pgbench/pgbench.c > @@ -44,7 +44,13 @@ > #include <signal.h> > #include <time.h> > #include <sys/time.h> > -#ifdef HAVE_SYS_SELECT_H > +#ifdef PGBENCH_USE_SELECT /* force use of select(2)? */ > +#undef HAVE_PPOLL > +#endif > +#ifdef HAVE_PPOLL > +#include <poll.h> > +#elif defined(HAVE_SYS_SELECT_H) > +#define POLL_USING_SELECT (random thing noticed while going through patches) It strikes me as a bad idea to undefine configure selected symbols. Postgres header might rely on them. It also strikes me as entirely unnecessary here. - Andres
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
>> -#ifdef HAVE_SYS_SELECT_H >> +#ifdef PGBENCH_USE_SELECT /* force use of select(2)? */ >> +#undef HAVE_PPOLL >> +#endif >> +#ifdef HAVE_PPOLL >> +#include <poll.h> >> +#elif defined(HAVE_SYS_SELECT_H) >> +#define POLL_USING_SELECT > > (random thing noticed while going through patches) > > It strikes me as a bad idea to undefine configure selected > symbols. Postgres header might rely on them. It also strikes me as > entirely unnecessary here. Yes, I though about this one but let it pass. Indeed, it would be sufficient to not load "poll.h" when select is forced, without undefining the configure setting. -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 2018-03-01 11:30:39 +0100, Fabien COELHO wrote: > > > > -#ifdef HAVE_SYS_SELECT_H > > > +#ifdef PGBENCH_USE_SELECT /* force use of select(2)? */ > > > +#undef HAVE_PPOLL > > > +#endif > > > +#ifdef HAVE_PPOLL > > > +#include <poll.h> > > > +#elif defined(HAVE_SYS_SELECT_H) > > > +#define POLL_USING_SELECT > > > > (random thing noticed while going through patches) > > > > It strikes me as a bad idea to undefine configure selected > > symbols. Postgres header might rely on them. It also strikes me as > > entirely unnecessary here. > > Yes, I though about this one but let it pass. Indeed, it would be sufficient > to not load "poll.h" when select is forced, without undefining the configure > setting. I've marked the CF entry waiting on author. Greetings, Andres Freund
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Updated the patch to not do the #undef pgbench11-ppoll-v11.patch attached. Thanks, doug On 3/3/18, 16:14, "Andres Freund" <andres@anarazel.de> wrote: On 2018-03-01 11:30:39 +0100, Fabien COELHO wrote: > > > > -#ifdef HAVE_SYS_SELECT_H > > > +#ifdef PGBENCH_USE_SELECT /* force use of select(2)? */ > > > +#undef HAVE_PPOLL > > > +#endif > > > +#ifdef HAVE_PPOLL > > > +#include <poll.h> > > > +#elif defined(HAVE_SYS_SELECT_H) > > > +#define POLL_USING_SELECT > > > > (random thing noticed while going through patches) > > > > It strikes me as a bad idea to undefine configure selected > > symbols. Postgres header might rely on them. It also strikes me as > > entirely unnecessary here. > > Yes, I though about this one but let it pass. Indeed, it would be sufficient > to not load "poll.h" when select is forced, without undefining the configure > setting. I've marked the CF entry waiting on author. Greetings, Andres Freund
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, > Updated the patch to not do the #undef > pgbench11-ppoll-v11.patch attached. Patch applies. Do not forget to regenerate configure to test... I've compiled and run with both ppoll & select options without issues. Two quite minor style comment (at least 2 instances): if (cond) return false; else return true; ISTM that the simpler the better: return !cond; Also ISTM that the following does not comply with pg C style expectations (idem, 2 instances): } else { -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 3/25/18, 04:00, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: Hello Fabien, Hello Doug, > Updated the patch to not do the #undef > pgbench11-ppoll-v11.patch attached. Patch applies. Do not forget to regenerate configure to test... I've compiled and run with both ppoll & select options without issues. Two quite minor style comment (at least 2 instances): if (cond) return false; else return true; ISTM that the simpler the better: return !cond; Fixed. Also ISTM that the following does not comply with pg C style expectations (idem, 2 instances): } else { Fixed. Also fixed issue with 'timeout' not being passed as NULL when no timeout time. -- Fabien. Thanks! doug
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hello Doug, >> I've compiled and run with both ppoll & select options without issues. >> Two quite minor style comment (at least 2 instances): > Fixed. Fixed. Also fixed issue with 'timeout' not being passed as NULL > when no timeout time. v12 compiled and tested with both ppoll & select (by forcing). All seems ok to me. Switched back to "ready". -- Fabien.
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
Hi, I'm still not particularly happy with this. Checking whether I can polish it up. a) the new function names are partially non-descriptive and their meaning is undocumented. As an extreme example: - if (!FD_ISSET(sock, &input_mask)) + if (ignore_socket(sockets, i, st->con)) continue; reading the new code it's entirely unclear what that could mean. Are you marking the socket as ignored? What does ignored even mean? There's not a single comment on what the new functions mean. It's not that bad if there's no docs on what FD_ISSET implies, because that's a well known and documented interface. But introducing an abstraction without any comments on it? b) Does this actually improve the situation all that much? We still loop over every socket: /* ok, advance the state machine of each connection */ for (i = 0; i < nstate; i++) { CState *st = &state[i]; if (st->state == CSTATE_WAIT_RESULT) { /* don't call doCustom unless data is available */ if (error_on_socket(sockets, i, st->con)) goto done; if (ignore_socket(sockets, i, st->con)) continue; } else if (st->state == CSTATE_FINISHED || st->state == CSTATE_ABORTED) { /* this client is done, no need to consider it anymore */ continue; } doCustom(thread, st, &aggs); /* If doCustom changed client to finished state, reduce remains */ if (st->state == CSTATE_FINISHED || st->state == CSTATE_ABORTED) remains--; } if the goal is to make this more scalable, wouldn't this require using a proper polling mechanism that supports signalling the sockets with relevant changes, rather than busy-looping through every socket if there's just a single change? I kinda wonder if the proper fix wouldn't be to have one patch making WaitEventSets usable from frontend code, and then make this code use them. Not a small project though. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I'm still not particularly happy with this. I'm a bit confused as to what the point is. It seems unlikely that one pgbench process can effectively drive enough backends for select's limitations to really be an issue. regards, tom lane
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 2018-04-06 17:49:19 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I'm still not particularly happy with this. > > I'm a bit confused as to what the point is. It seems unlikely that one > pgbench process can effectively drive enough backends for select's > limitations to really be an issue. It's not that crazy to use more than 1024 connections (common FD_SETSIZE valu), especially over a higher latency connection. As I wrote, I think using a poll API that doesn't require looping over all sockets, even if they didn't get an event, would be a better plan. - Andres
Re: PATCH: pgbench - option to build using ppoll() for larger connection counts
On Apr 7, 2018, at 12:49 AM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> I'm still not particularly happy with this. > > I'm a bit confused as to what the point is. It seems unlikely that one > pgbench process can effectively drive enough backends for select's > limitations to really be an issue. pgbench is multithreaded application, so in theory it can drive almost arbitrary number of connections. It is limited only by network throughput, but if pgbench is launched at the same host and connected to the server throughloopback or unix sockets, then network is also not a limit. We quite often have to spawn more than 1k connections and SMP systems with hundreds of CPU. So there are two choices: either use patched version of pgbench which is using poll, either spawn several instances of pgbench(which is not always convenient). > > regards, tom lane >
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug <radydoug@amazon.com> wrote: > pgbench11-ppoll-v12.patch Hi Doug, FYI this patch is trying and failing to use ppoll() on Windows: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30 -- Thomas Munro http://www.enterprisedb.com
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 05/17/2018 01:23 AM, Thomas Munro wrote: > On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug <radydoug@amazon.com> wrote: >> pgbench11-ppoll-v12.patch > Hi Doug, > > FYI this patch is trying and failing to use ppoll() on Windows: > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30 > It's still failing - see <https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4098> I'm setting this back to "Waiting on Author" until that's fixed. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 07/03/2018 07:52 PM, Andrew Dunstan wrote: > > > On 05/17/2018 01:23 AM, Thomas Munro wrote: >> On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug <radydoug@amazon.com> wrote: >>> pgbench11-ppoll-v12.patch >> Hi Doug, >> >> FYI this patch is trying and failing to use ppoll() on Windows: >> >> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30 >> > > > It's still failing - see > <https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4098> > > I'm setting this back to "Waiting on Author" until that's fixed. > The author hasn't replied, but the attached seems to have cured the bitrot so that it at least applies. Let's see what the cfbot makes of it and then possibly fix any Windows issues. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 08/09/2018 12:45 PM, Andrew Dunstan wrote: > > > On 07/03/2018 07:52 PM, Andrew Dunstan wrote: >> >> >> On 05/17/2018 01:23 AM, Thomas Munro wrote: >>> On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug <radydoug@amazon.com> >>> wrote: >>>> pgbench11-ppoll-v12.patch >>> Hi Doug, >>> >>> FYI this patch is trying and failing to use ppoll() on Windows: >>> >>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30 >>> >>> >> >> >> It's still failing - see >> <https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4098> >> >> I'm setting this back to "Waiting on Author" until that's fixed. >> > > > The author hasn't replied, but the attached seems to have cured the > bitrot so that it at least applies. Let's see what the cfbot makes of > it and then possibly fix any Windows issues. > > > And there's still a Windows problem which I think is cured in the attached patch cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
On 08/09/2018 05:46 PM, Andrew Dunstan wrote: > > > On 08/09/2018 12:45 PM, Andrew Dunstan wrote: >> >> >> On 07/03/2018 07:52 PM, Andrew Dunstan wrote: >>> >>> >>> On 05/17/2018 01:23 AM, Thomas Munro wrote: >>>> On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug <radydoug@amazon.com> >>>> wrote: >>>>> pgbench11-ppoll-v12.patch >>>> Hi Doug, >>>> >>>> FYI this patch is trying and failing to use ppoll() on Windows: >>>> >>>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30 >>>> >>>> >>> >>> >>> It's still failing - see >>> <https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4098> >>> >>> I'm setting this back to "Waiting on Author" until that's fixed. >>> >> >> >> The author hasn't replied, but the attached seems to have cured the >> bitrot so that it at least applies. Let's see what the cfbot makes of >> it and then possibly fix any Windows issues. >> >> >> > > > And there's still a Windows problem which I think is cured in the > attached patch > and the CFBot agrees. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: pgbench - option to build using ppoll() for largerconnection counts
>> The author hasn't replied, but the attached seems to have cured the >> bitrot so that it at least applies. Let's see what the cfbot makes of >> it and then possibly fix any Windows issues. The patch was not applying cleanly anymore for me, so here is a rebase of your latest version. Morever, ISTM that Tom's "why?" question has been answered: there are very large systems out there with many processors, which are tested against many connections, exceeding select limit. I have turned back this patch to ready. -- Fabien.
Attachment
"Rady, Doug" <radydoug@amazon.com> writes: > This patch enables building pgbench to use ppoll() instead of select() > to allow for more than (FD_SETSIZE - 10) connections. As implemented, > when using ppoll(), the only connection limitation is system resources. So ... why exactly is this patch insisting on ppoll() rather than just plain poll()? AFAICS, all you're doing with that is being able to specify the timeout in microsec not millisec, which does not really justify taking much of a hit in portability, to my mind. > “… ppoll() is to poll() as pselect() is to select(). Since the > existing code uses select(), why not convert to poll() rather than > ppoll()?” A moment's glance at our git history will remind you that we attempted to start using pselect() last year, and the attempt crashed and burned: Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL_10_BR [64925603c] 2017-04-24 18:29:03 -0400 Revert "Use pselect(2) not select(2), if available, to wait in postmaster's loop." This reverts commit 81069a9efc5a374dd39874a161f456f0fb3afba4. Buildfarm results suggest that some platforms have versions of pselect(2) that are not merely non-atomic, but flat out non-functional. Revert the use-pselect patch to confirm this diagnosis (and exclude the no-SA_RESTART patch as the source of trouble). If it's so, we should probably look into blacklisting specific platforms that have broken pselect. Discussion: https://postgr.es/m/9696.1493072081@sss.pgh.pa.us Now, it might be that ppoll doesn't suffer from as many portability problems as pselect, but I don't see a good reason to assume that. So I'd rather see if we can't convert this to use poll(), and thereby ensure that it works basically everywhere. regards, tom lane
I wrote: > So ... why exactly is this patch insisting on ppoll() rather than just > plain poll()? AFAICS, all you're doing with that is being able to > specify the timeout in microsec not millisec, which does not really > justify taking much of a hit in portability, to my mind. To check into my assumptions here, I did a bit of testing to find out what ppoll can really do in terms of timeout accuracy. As best I can tell, its resolution is on the order of 100 usec on both fairly new Linux kernels (4.17.19 on Fedora 28) and fairly old ones (2.6.32 on RHEL6). So there's not really that much daylight between what you can do with ppoll() and with poll()'s millisecond resolution. select(2) seems to have about the same effective timeout resolution. Interestingly, select(2)'s timeout resolution is noticeably better than that, around 10 usec, on recent macOS. Didn't try other platforms. Also, I'd misunderstood the portability angle. Unlike pselect(), ppoll() is *not* in POSIX. It started as a Linux-ism, although I see it's appeared recently in FreeBSD. That somewhat assuages my fear of broken implementations on specific platforms, but it also means that it's going to be far from universally available. So the question here really boils down to whether you think that a large set of pgbench connections is interesting on non-Linux platforms. There's a case to be made that it isn't, perhaps, but I'm not exactly sold. On the other hand, while we could certainly make a poll-based code path within this patch, I'm not quite sure what we'd do with it. My results for macOS indicate that using poll rather than select would create a tradeoff: in return for possibly allowing more clients, there would be a definite loss in timeout precision. I don't think that limiting \sleep commands to ms precision would be so awful, but it's easier to believe that loss of precision in the transaction dispatch rate for "--rate" tests could be a problem for some people. So we might have to expose the choice of which call to use to users, which seems like a mess. So maybe what we've got here --- make it better on Linux, with no change elsewhere --- is about as good as we can hope for. Also, I notice that the kevent syscall available on macOS and some BSDen uses a struct-timespec timeout parameter, ie, theoretical nsec resolution same as ppoll. So there's a case to be made that where we should go for those platforms is to build a kevent-based code path not a poll-based one. It'd be unreasonable to insist on this patch providing that, though. Anyway, bottom line: my objection on Wednesday was mainly based on the assumption that we might have to contend with broken ppoll on some platforms, which now appears to be fallacious. So, objection withdrawn. regards, tom lane
Fabien COELHO <coelho@cri.ensmp.fr> writes: > The patch was not applying cleanly anymore for me, so here is a rebase of > your latest version. The cfbot doesn't like that patch, probably because of the Windows newlines. Here's a version with regular newlines, and some cosmetic cleanup in the configure infrastructure. I haven't looked at the pgbench changes proper yet, but I did quickly test building on FreeBSD 11, which has ppoll, and it falls over: pgbench.c:6080:69: error: use of undeclared identifier 'POLLRDHUP' ...== -1 || (PQsocket(con) >= 0 && !(sa[idx].revents & POLL_UNWANTED))) ^ pgbench.c:6059:24: note: expanded from macro 'POLL_UNWANTED' #define POLL_UNWANTED (POLLRDHUP|POLLERR|POLLHUP|POLLNVAL) ^ pgbench.c:6085:42: error: use of undeclared identifier 'POLLRDHUP' errno, sa[idx].fd, (sa[idx].revents & POLL_UNWANTED)); ^ pgbench.c:6059:24: note: expanded from macro 'POLL_UNWANTED' #define POLL_UNWANTED (POLLRDHUP|POLLERR|POLLHUP|POLLNVAL) ^ pgbench.c:6107:19: error: use of undeclared identifier 'POLLRDHUP' sa[idx].events = POLL_EVENTS; ^ pgbench.c:6057:22: note: expanded from macro 'POLL_EVENTS' #define POLL_EVENTS (POLLRDHUP|POLLIN|POLLPRI) ^ 3 errors generated. make[3]: *** [<builtin>: pgbench.o] Error 1 I'm strongly tempted to just remove the POLL_UNWANTED business altogether, as it seems both pointless and unportable on its face. Almost by definition, we can't know what "other" bits a given implementation might set. I'm not entirely following the point of including POLLRDHUP in POLL_EVENTS, either. What's wrong with the traditional solution of detecting EOF? regards, tom lane diff --git a/configure b/configure index 9b30402..21ecd29 100755 *** a/configure --- b/configure *************** fi *** 15093,15099 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstatpthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 15093,15099 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppollpstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 2e60a89..8fe6894 100644 *** a/configure.in --- b/configure.in *************** PGAC_FUNC_WCSTOMBS_L *** 1562,1568 **** LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocatepstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_rangeutime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in --- 1562,1568 ---- LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocateppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_rangeutime utimes wcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 41b756c..3d378db 100644 *** a/src/bin/pgbench/pgbench.c --- b/src/bin/pgbench/pgbench.c *************** *** 45,53 **** --- 45,62 ---- #include <signal.h> #include <time.h> #include <sys/time.h> + #ifndef PGBENCH_USE_SELECT /* force use of select(2)? */ + #ifdef HAVE_PPOLL + #define POLL_USING_PPOLL + #include <poll.h> + #endif + #endif + #ifndef POLL_USING_PPOLL + #define POLL_USING_SELECT #ifdef HAVE_SYS_SELECT_H #include <sys/select.h> #endif + #endif #ifdef HAVE_SYS_RESOURCE_H #include <sys/resource.h> /* for getrlimit */ *************** static int pthread_join(pthread_t th, vo *** 92,104 **** /******************************************************************** * some configurable parameters */ ! ! /* max number of clients allowed */ #ifdef FD_SETSIZE ! #define MAXCLIENTS (FD_SETSIZE - 10) #else ! #define MAXCLIENTS 1024 #endif #define DEFAULT_INIT_STEPS "dtgvp" /* default -I setting */ --- 101,119 ---- /******************************************************************** * some configurable parameters */ ! #ifdef POLL_USING_SELECT /* using select(2) */ ! #define SOCKET_WAIT_METHOD "select" ! typedef fd_set socket_set; #ifdef FD_SETSIZE ! #define MAXCLIENTS (FD_SETSIZE - 10) /* system limited max number of clients allowed */ #else ! #define MAXCLIENTS 1024 /* max number of clients allowed */ #endif + #else /* using ppoll(2) */ + #define SOCKET_WAIT_METHOD "ppoll" + typedef struct pollfd socket_set; + #define MAXCLIENTS -1 /* unlimited number of clients */ + #endif /* POLL_USING_SELECT */ #define DEFAULT_INIT_STEPS "dtgvp" /* default -I setting */ *************** static void addScript(ParsedScript scrip *** 525,530 **** --- 540,552 ---- static void *threadRun(void *arg); static void setalarm(int seconds); static void finishCon(CState *st); + static socket_set *alloc_socket_set(int count); + static bool error_on_socket(socket_set *sa, int idx, PGconn *con); + static void free_socket_set(socket_set *sa); + static bool ignore_socket(socket_set *sa, int idx, PGconn *con); + static void clear_socket_set(socket_set *sa, int count); + static void set_socket(socket_set *sa, int fd, int idx); + static int wait_on_socket_set(socket_set *sa, int nstate, int maxsock, int64 usec); /* callback functions for our flex lexer */ *************** doConnect(void) *** 1143,1148 **** --- 1165,1171 ---- !have_password) { PQfinish(conn); + conn = NULL; simple_prompt("Password: ", password, sizeof(password), false); have_password = true; new_pass = true; *************** main(int argc, char **argv) *** 4903,4909 **** case 'c': benchmarking_option_set = true; nclients = atoi(optarg); ! if (nclients <= 0 || nclients > MAXCLIENTS) { fprintf(stderr, "invalid number of clients: \"%s\"\n", optarg); --- 4926,4932 ---- case 'c': benchmarking_option_set = true; nclients = atoi(optarg); ! if (nclients <= 0 || (MAXCLIENTS != -1 && nclients > MAXCLIENTS)) { fprintf(stderr, "invalid number of clients: \"%s\"\n", optarg); *************** threadRun(void *arg) *** 5614,5619 **** --- 5637,5643 ---- int64 next_report = last_report + (int64) progress * 1000000; StatsData last, aggs; + socket_set *sockets = alloc_socket_set(nstate); /* * Initialize throttling rate target for all of the thread's clients. It *************** threadRun(void *arg) *** 5657,5662 **** --- 5681,5687 ---- { if ((state[i].con = doConnect()) == NULL) goto done; + set_socket(sockets, PQsocket(state[i].con), i); } } *************** threadRun(void *arg) *** 5673,5685 **** /* loop till all clients have terminated */ while (remains > 0) { - fd_set input_mask; int maxsock; /* max socket number to be waited for */ int64 min_usec; int64 now_usec = 0; /* set this only if needed */ /* identify which client sockets should be checked for input */ ! FD_ZERO(&input_mask); maxsock = -1; min_usec = PG_INT64_MAX; for (i = 0; i < nstate; i++) --- 5698,5709 ---- /* loop till all clients have terminated */ while (remains > 0) { int maxsock; /* max socket number to be waited for */ int64 min_usec; int64 now_usec = 0; /* set this only if needed */ /* identify which client sockets should be checked for input */ ! clear_socket_set(sockets, nstate); maxsock = -1; min_usec = PG_INT64_MAX; for (i = 0; i < nstate; i++) *************** threadRun(void *arg) *** 5728,5734 **** goto done; } ! FD_SET(sock, &input_mask); if (maxsock < sock) maxsock = sock; } --- 5752,5758 ---- goto done; } ! set_socket(sockets, sock, i); if (maxsock < sock) maxsock = sock; } *************** threadRun(void *arg) *** 5765,5771 **** /* * If no clients are ready to execute actions, sleep until we receive * data from the server, or a nap-time specified in the script ends, ! * or it's time to print a progress report. Update input_mask to show * which client(s) received data. */ if (min_usec > 0) --- 5789,5795 ---- /* * If no clients are ready to execute actions, sleep until we receive * data from the server, or a nap-time specified in the script ends, ! * or it's time to print a progress report. Update sockets to show * which client(s) received data. */ if (min_usec > 0) *************** threadRun(void *arg) *** 5776,5786 **** { if (maxsock != -1) { ! struct timeval timeout; ! ! timeout.tv_sec = min_usec / 1000000; ! timeout.tv_usec = min_usec % 1000000; ! nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout); } else /* nothing active, simple sleep */ { --- 5800,5806 ---- { if (maxsock != -1) { ! nsocks = wait_on_socket_set(sockets, nstate, maxsock, min_usec); } else /* nothing active, simple sleep */ { *************** threadRun(void *arg) *** 5789,5795 **** } else /* no explicit delay, select without timeout */ { ! nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); } if (nsocks < 0) --- 5809,5815 ---- } else /* no explicit delay, select without timeout */ { ! nsocks = wait_on_socket_set(sockets, nstate, maxsock, 0); } if (nsocks < 0) *************** threadRun(void *arg) *** 5800,5806 **** continue; } /* must be something wrong */ ! fprintf(stderr, "select() failed: %s\n", strerror(errno)); goto done; } } --- 5820,5826 ---- continue; } /* must be something wrong */ ! fprintf(stderr, "%s() failed: %s\n", SOCKET_WAIT_METHOD, strerror(errno)); goto done; } } *************** threadRun(void *arg) *** 5809,5815 **** /* min_usec == 0, i.e. something needs to be executed */ /* If we didn't call select(), don't try to read any data */ ! FD_ZERO(&input_mask); } /* ok, advance the state machine of each connection */ --- 5829,5835 ---- /* min_usec == 0, i.e. something needs to be executed */ /* If we didn't call select(), don't try to read any data */ ! clear_socket_set(sockets, nstate); } /* ok, advance the state machine of each connection */ *************** threadRun(void *arg) *** 5820,5835 **** if (st->state == CSTATE_WAIT_RESULT) { /* don't call doCustom unless data is available */ - int sock = PQsocket(st->con); ! if (sock < 0) ! { ! fprintf(stderr, "invalid socket: %s", ! PQerrorMessage(st->con)); goto done; - } ! if (!FD_ISSET(sock, &input_mask)) continue; } else if (st->state == CSTATE_FINISHED || --- 5840,5850 ---- if (st->state == CSTATE_WAIT_RESULT) { /* don't call doCustom unless data is available */ ! if (error_on_socket(sockets, i, st->con)) goto done; ! if (ignore_socket(sockets, i, st->con)) continue; } else if (st->state == CSTATE_FINISHED || *************** done: *** 5967,5972 **** --- 5982,5989 ---- fclose(thread->logfile); thread->logfile = NULL; } + free_socket_set(sockets); + sockets = NULL; return NULL; } *************** finishCon(CState *st) *** 5980,5985 **** --- 5997,6131 ---- } } + #ifdef POLL_USING_SELECT /* select(2) based socket polling */ + static socket_set * + alloc_socket_set(int count) + { + return (socket_set *) pg_malloc0(sizeof(socket_set)); + } + + static void + free_socket_set(socket_set *sa) + { + pg_free(sa); + } + + static bool + error_on_socket(socket_set *sa, int idx, PGconn *con) + { + if (PQsocket(con) >= 0) return false; + fprintf(stderr, "invalid socket: %s", PQerrorMessage(con)); + return true; + } + + static bool + ignore_socket(socket_set *sa, int idx, PGconn *con) + { + return !(FD_ISSET(PQsocket(con), sa)); + } + + static void + clear_socket_set(socket_set *sa, int count) + { + FD_ZERO(sa); + } + + static void + set_socket(socket_set *sa, int fd, int idx) + { + FD_SET(fd, sa); + } + + static int + wait_on_socket_set(socket_set *sa, int nstate, int maxsock, int64 usec) + { + struct timeval timeout; + + if (usec) + { + timeout.tv_sec = usec / 1000000; + timeout.tv_usec = usec % 1000000; + return select(maxsock + 1, sa, NULL, NULL, &timeout); + } + else + { + return select(maxsock + 1, sa, NULL, NULL, NULL); + } + } + #else /* ppoll(2) based socket polling */ + /* ppoll() will block until timeout or one of POLL_EVENTS occurs. */ + #define POLL_EVENTS (POLLRDHUP|POLLIN|POLLPRI) + /* ppoll() events returned that we do not want/expect to see. */ + #define POLL_UNWANTED (POLLRDHUP|POLLERR|POLLHUP|POLLNVAL) + + static socket_set * + alloc_socket_set(int count) + { + return (socket_set *) pg_malloc0(sizeof(socket_set) * count); + } + + static void + free_socket_set(socket_set *sa) + { + pg_free(sa); + } + + static bool + error_on_socket(socket_set *sa, int idx, PGconn *con) + { + /* + * No error if socket not used or non-error status from PQsocket() and none + * of the unwanted ppoll() return events. + */ + if (sa[idx].fd == -1 || (PQsocket(con) >= 0 && !(sa[idx].revents & POLL_UNWANTED))) + return false; + fprintf(stderr, "invalid socket: %s", PQerrorMessage(con)); + if (debug) + fprintf(stderr, "ppoll() fail - errno: %d, socket: %d, events: %x\n", + errno, sa[idx].fd, (sa[idx].revents & POLL_UNWANTED)); + return true; + } + + static bool + ignore_socket(socket_set *sa, int idx, PGconn *con) + { + return (sa[idx].fd != -1 && !sa[idx].revents); + } + + static void + clear_socket_set(socket_set *sa, int count) + { + int i = 0; + for (i = 0; i < count; i++) + set_socket(sa, -1, i); + } + + static void + set_socket(socket_set *sa, int fd, int idx) + { + sa[idx].fd = fd; + sa[idx].events = POLL_EVENTS; + sa[idx].revents = 0; + } + + static int + wait_on_socket_set(socket_set *sa, int nstate, int maxsock, int64 usec) + { + struct timespec timeout; + + if (usec) + { + timeout.tv_sec = usec / 1000000; + timeout.tv_nsec = usec % 1000000000; + return ppoll(sa, nstate, &timeout, NULL); + } + else + { + return ppoll(sa, nstate, NULL, NULL); + } + } + #endif /* PGBENCH_USE_SELECT */ + /* * Support for duration option: set timer_exceeded after so many seconds. */ diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 4094e22..5d40796 100644 *** a/src/include/pg_config.h.in --- b/src/include/pg_config.h.in *************** *** 443,448 **** --- 443,451 ---- /* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */ #undef HAVE_PPC_LWARX_MUTEX_HINT + /* Define to 1 if you have the `ppoll' function. */ + #undef HAVE_PPOLL + /* Define to 1 if you have the `pstat' function. */ #undef HAVE_PSTAT diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 6618b43..182698a 100644 *** a/src/include/pg_config.h.win32 --- b/src/include/pg_config.h.win32 *************** *** 327,332 **** --- 327,335 ---- /* Define to 1 if you have the `posix_fallocate' function. */ /* #undef HAVE_POSIX_FALLOCATE */ + /* Define to 1 if you have the `ppoll' function. */ + /* #undef HAVE_PPOLL */ + /* Define to 1 if you have the `pstat' function. */ /* #undef HAVE_PSTAT */ diff --git a/src/template/linux b/src/template/linux index f820bf7..e392908 100644 *** a/src/template/linux --- b/src/template/linux *************** if test x"$PREFERRED_SEMAPHORES" = x"" ; *** 6,11 **** --- 6,12 ---- fi # Force _GNU_SOURCE on; plperl is broken with Perl 5.8.0 otherwise + # This is also required for ppoll(2), and perhaps other things CPPFLAGS="$CPPFLAGS -D_GNU_SOURCE" # If --enable-profiling is specified, we need -DLINUX_PROFILE
I wrote: > I'm strongly tempted to just remove the POLL_UNWANTED business > altogether, as it seems both pointless and unportable on its face. > Almost by definition, we can't know what "other" bits a given > implementation might set. > I'm not entirely following the point of including POLLRDHUP in > POLL_EVENTS, either. What's wrong with the traditional solution > of detecting EOF? So after studying that a bit longer, I think it's just wrong. It's not the business of this code to be checking for connection errors at all; that is libpq's province. The libpq API specifies that callers should wait for read-ready on the socket, and nothing else. So the only bit we need concern ourselves with is POLLIN. I also seriously disliked both the details of the abstraction API and its lack of documentation. (Other people complained about that upthread, too.) So attached is a rewrite attempt. There's still a couple of grotty things about it; in particular the ppoll variant of socket_has_input() knows more than one could wish about how it's being used. But I couldn't see a way to make it cleaner without significant changes to the logic in threadRun, and that didn't seem better. I think that Andres' concern upthread about iterating over a whole lot of sockets is somewhat misplaced. We aren't going to be iterating over the entire set of client connections, only those being run by a particular pgbench thread. So assuming you're using a reasonable ratio of threads to clients, there won't be very many to look at in any one thread. In any case, I'm dubious that we could get much of a win from some other abstraction for waiting: both of these code paths do work pretty much proportional to the number of connections the current thread is responsible for, and it's hard to see how to avoid that. I've tested this on both Linux and FreeBSD, and it seems to work fine. I'm reasonably happy with this version of the patch, and would be ready to commit it, but I thought I'd throw it out for another round of review if anyone wants to. regards, tom lane diff --git a/configure b/configure index 9b30402..21ecd29 100755 --- a/configure +++ b/configure @@ -15093,7 +15093,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstatpthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l +for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppollpstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.in b/configure.in index 2e60a89..8fe6894 100644 --- a/configure.in +++ b/configure.in @@ -1562,7 +1562,7 @@ PGAC_FUNC_WCSTOMBS_L LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate pstatpthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l]) +AC_CHECK_FUNCS([cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fallocate ppollpstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open symlink sync_file_range utime utimeswcstombs_l]) AC_REPLACE_FUNCS(fseeko) case $host_os in diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 41b756c..ae81aba 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -28,8 +28,8 @@ */ #ifdef WIN32 -#define FD_SETSIZE 1024 /* set before winsock2.h is included */ -#endif /* ! WIN32 */ +#define FD_SETSIZE 1024 /* must set before winsock2.h is included */ +#endif #include "postgres_fe.h" #include "fe_utils/conditional.h" @@ -45,12 +45,21 @@ #include <signal.h> #include <time.h> #include <sys/time.h> +#ifdef HAVE_SYS_RESOURCE_H +#include <sys/resource.h> /* for getrlimit */ +#endif + +/* For testing, PGBENCH_USE_SELECT can be defined to force use of that code */ +#if defined(HAVE_PPOLL) && !defined(PGBENCH_USE_SELECT) +#define POLL_USING_PPOLL +#ifdef HAVE_POLL_H +#include <poll.h> +#endif +#else /* no ppoll(), so use select() */ +#define POLL_USING_SELECT #ifdef HAVE_SYS_SELECT_H #include <sys/select.h> #endif - -#ifdef HAVE_SYS_RESOURCE_H -#include <sys/resource.h> /* for getrlimit */ #endif #ifndef M_PI @@ -71,6 +80,33 @@ #define MM2_ROT 47 /* + * Multi-platform socket set implementations + */ + +#ifdef POLL_USING_PPOLL +#define SOCKET_WAIT_METHOD "ppoll" + +typedef struct socket_set +{ + int maxfds; /* allocated length of pollfds[] array */ + int curfds; /* number currently in use */ + struct pollfd pollfds[FLEXIBLE_ARRAY_MEMBER]; +} socket_set; + +#endif /* POLL_USING_PPOLL */ + +#ifdef POLL_USING_SELECT +#define SOCKET_WAIT_METHOD "select" + +typedef struct socket_set +{ + int maxfd; /* largest FD currently set in fds */ + fd_set fds; +} socket_set; + +#endif /* POLL_USING_SELECT */ + +/* * Multi-platform pthread implementations */ @@ -93,13 +129,6 @@ static int pthread_join(pthread_t th, void **thread_return); /******************************************************************** * some configurable parameters */ -/* max number of clients allowed */ -#ifdef FD_SETSIZE -#define MAXCLIENTS (FD_SETSIZE - 10) -#else -#define MAXCLIENTS 1024 -#endif - #define DEFAULT_INIT_STEPS "dtgvp" /* default -I setting */ #define LOG_STEP_SECONDS 5 /* seconds between log messages */ @@ -523,8 +552,14 @@ static void processXactStats(TState *thread, CState *st, instr_time *now, static void pgbench_error(const char *fmt,...) pg_attribute_printf(1, 2); static void addScript(ParsedScript script); static void *threadRun(void *arg); -static void setalarm(int seconds); static void finishCon(CState *st); +static void setalarm(int seconds); +static socket_set *alloc_socket_set(int count); +static void free_socket_set(socket_set *sa); +static void clear_socket_set(socket_set *sa); +static void add_socket_to_set(socket_set *sa, int fd, int idx); +static int wait_on_socket_set(socket_set *sa, int64 usecs); +static bool socket_has_input(socket_set *sa, int fd, int idx); /* callback functions for our flex lexer */ @@ -4903,7 +4938,7 @@ main(int argc, char **argv) case 'c': benchmarking_option_set = true; nclients = atoi(optarg); - if (nclients <= 0 || nclients > MAXCLIENTS) + if (nclients <= 0) { fprintf(stderr, "invalid number of clients: \"%s\"\n", optarg); @@ -5606,6 +5641,7 @@ threadRun(void *arg) end; int nstate = thread->nstate; int remains = nstate; /* number of remaining clients */ + socket_set *sockets = alloc_socket_set(nstate); int i; /* for reporting progress: */ @@ -5673,14 +5709,16 @@ threadRun(void *arg) /* loop till all clients have terminated */ while (remains > 0) { - fd_set input_mask; - int maxsock; /* max socket number to be waited for */ + int nsocks; /* number of sockets to be waited for */ int64 min_usec; int64 now_usec = 0; /* set this only if needed */ - /* identify which client sockets should be checked for input */ - FD_ZERO(&input_mask); - maxsock = -1; + /* + * identify which client sockets should be checked for input, and + * compute the nearest time (if any) at which we need to wake up. + */ + clear_socket_set(sockets); + nsocks = 0; min_usec = PG_INT64_MAX; for (i = 0; i < nstate; i++) { @@ -5728,9 +5766,7 @@ threadRun(void *arg) goto done; } - FD_SET(sock, &input_mask); - if (maxsock < sock) - maxsock = sock; + add_socket_to_set(sockets, sock, nsocks++); } else if (st->state != CSTATE_ABORTED && st->state != CSTATE_FINISHED) @@ -5764,35 +5800,29 @@ threadRun(void *arg) /* * If no clients are ready to execute actions, sleep until we receive - * data from the server, or a nap-time specified in the script ends, - * or it's time to print a progress report. Update input_mask to show - * which client(s) received data. + * data on some client socket or the timeout (if any) elapses. */ if (min_usec > 0) { - int nsocks = 0; /* return from select(2) if called */ + int rc = 0; if (min_usec != PG_INT64_MAX) { - if (maxsock != -1) + if (nsocks > 0) { - struct timeval timeout; - - timeout.tv_sec = min_usec / 1000000; - timeout.tv_usec = min_usec % 1000000; - nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout); + rc = wait_on_socket_set(sockets, min_usec); } else /* nothing active, simple sleep */ { pg_usleep(min_usec); } } - else /* no explicit delay, select without timeout */ + else /* no explicit delay, wait without timeout */ { - nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); + rc = wait_on_socket_set(sockets, 0); } - if (nsocks < 0) + if (rc < 0) { if (errno == EINTR) { @@ -5800,19 +5830,20 @@ threadRun(void *arg) continue; } /* must be something wrong */ - fprintf(stderr, "select() failed: %s\n", strerror(errno)); + fprintf(stderr, "%s() failed: %s\n", SOCKET_WAIT_METHOD, strerror(errno)); goto done; } } else { - /* min_usec == 0, i.e. something needs to be executed */ + /* min_usec <= 0, i.e. something needs to be executed now */ - /* If we didn't call select(), don't try to read any data */ - FD_ZERO(&input_mask); + /* If we didn't wait, don't try to read any data */ + clear_socket_set(sockets); } /* ok, advance the state machine of each connection */ + nsocks = 0; for (i = 0; i < nstate; i++) { CState *st = &state[i]; @@ -5829,7 +5860,7 @@ threadRun(void *arg) goto done; } - if (!FD_ISSET(sock, &input_mask)) + if (!socket_has_input(sockets, sock, nsocks++)) continue; } else if (st->state == CSTATE_FINISHED || @@ -5967,6 +5998,7 @@ done: fclose(thread->logfile); thread->logfile = NULL; } + free_socket_set(sockets); return NULL; } @@ -6025,8 +6057,185 @@ setalarm(int seconds) } } +#endif /* WIN32 */ + + +/* + * These functions provide an abstraction layer that hides the syscall + * we use to wait for input on a set of sockets. + * + * Currently there are two implementations, based on ppoll(2) and select(2). + * ppoll() is preferred where available due to its typically higher ceiling + * on the number of usable sockets. We do not use the more-widely-available + * poll(2) because it only offers millisecond timeout resolution, which could + * be problematic with high --rate settings. + * + * Function APIs: + * + * alloc_socket_set: allocate an empty socket set with room for up to + * "count" sockets. + * + * free_socket_set: deallocate a socket set. + * + * clear_socket_set: reset a socket set to empty. + * + * add_socket_to_set: add socket with indicated FD to slot "idx" in the + * socket set. Slots must be filled in order, starting with 0. + * + * wait_on_socket_set: wait for input on any socket in set, or for timeout + * to expire. timeout is measured in microseconds; 0 means wait forever. + * Returns result code of underlying syscall (>=0 if OK, else see errno). + * + * socket_has_input: after waiting, call this to see if given socket has + * input. fd and idx parameters should match some previous call to + * add_socket_to_set. + * + * Note that wait_on_socket_set destructively modifies the state of the + * socket set. After checking for input, caller must apply clear_socket_set + * and add_socket_to_set again before waiting again. + */ + +#ifdef POLL_USING_PPOLL + +static socket_set * +alloc_socket_set(int count) +{ + socket_set *sa; + + sa = (socket_set *) pg_malloc0(offsetof(socket_set, pollfds) + + sizeof(struct pollfd) * count); + sa->maxfds = count; + sa->curfds = 0; + return sa; +} + +static void +free_socket_set(socket_set *sa) +{ + pg_free(sa); +} + +static void +clear_socket_set(socket_set *sa) +{ + sa->curfds = 0; +} + +static void +add_socket_to_set(socket_set *sa, int fd, int idx) +{ + Assert(idx < sa->maxfds && idx == sa->curfds); + sa->pollfds[idx].fd = fd; + sa->pollfds[idx].events = POLLIN; + sa->pollfds[idx].revents = 0; + sa->curfds++; +} + +static int +wait_on_socket_set(socket_set *sa, int64 usecs) +{ + if (usecs > 0) + { + struct timespec timeout; + + timeout.tv_sec = usecs / 1000000; + timeout.tv_nsec = (usecs % 1000000) * 1000; + return ppoll(sa->pollfds, sa->curfds, &timeout, NULL); + } + else + { + return ppoll(sa->pollfds, sa->curfds, NULL, NULL); + } +} + +static bool +socket_has_input(socket_set *sa, int fd, int idx) +{ + /* + * In some cases, threadRun will apply clear_socket_set and then try to + * apply socket_has_input anyway with arguments that it used before that, + * or might've used before that except that it exited its setup loop + * early. Hence, if the socket set is empty, silently return false + * regardless of the parameters. If it's not empty, we can Assert that + * the parameters match a previous call. + */ + if (sa->curfds == 0) + return false; + + Assert(idx < sa->curfds && sa->pollfds[idx].fd == fd); + return (sa->pollfds[idx].revents & POLLIN) != 0; +} + +#endif /* POLL_USING_PPOLL */ + +#ifdef POLL_USING_SELECT + +static socket_set * +alloc_socket_set(int count) +{ + return (socket_set *) pg_malloc0(sizeof(socket_set)); +} + +static void +free_socket_set(socket_set *sa) +{ + pg_free(sa); +} + +static void +clear_socket_set(socket_set *sa) +{ + FD_ZERO(&sa->fds); + sa->maxfd = -1; +} + +static void +add_socket_to_set(socket_set *sa, int fd, int idx) +{ + if (fd < 0 || fd >= FD_SETSIZE) + { + /* + * Doing a hard exit here is a bit grotty, but it doesn't seem worth + * complicating the API to make it less grotty. + */ + fprintf(stderr, "too many client connections for select()\n"); + exit(1); + } + FD_SET(fd, &sa->fds); + if (fd > sa->maxfd) + sa->maxfd = fd; +} + +static int +wait_on_socket_set(socket_set *sa, int64 usecs) +{ + if (usecs > 0) + { + struct timeval timeout; + + timeout.tv_sec = usecs / 1000000; + timeout.tv_usec = usecs % 1000000; + return select(sa->maxfd + 1, &sa->fds, NULL, NULL, &timeout); + } + else + { + return select(sa->maxfd + 1, &sa->fds, NULL, NULL, NULL); + } +} + +static bool +socket_has_input(socket_set *sa, int fd, int idx) +{ + return (FD_ISSET(fd, &sa->fds) != 0); +} + +#endif /* POLL_USING_SELECT */ + + /* partial pthread implementation for Windows */ +#ifdef WIN32 + typedef struct win32_pthread { HANDLE handle; diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 4094e22..5d40796 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -443,6 +443,9 @@ /* Define to 1 if the assembler supports PPC's LWARX mutex hint bit. */ #undef HAVE_PPC_LWARX_MUTEX_HINT +/* Define to 1 if you have the `ppoll' function. */ +#undef HAVE_PPOLL + /* Define to 1 if you have the `pstat' function. */ #undef HAVE_PSTAT diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 6618b43..182698a 100644 --- a/src/include/pg_config.h.win32 +++ b/src/include/pg_config.h.win32 @@ -327,6 +327,9 @@ /* Define to 1 if you have the `posix_fallocate' function. */ /* #undef HAVE_POSIX_FALLOCATE */ +/* Define to 1 if you have the `ppoll' function. */ +/* #undef HAVE_PPOLL */ + /* Define to 1 if you have the `pstat' function. */ /* #undef HAVE_PSTAT */ diff --git a/src/template/linux b/src/template/linux index f820bf7..e392908 100644 --- a/src/template/linux +++ b/src/template/linux @@ -6,6 +6,7 @@ if test x"$PREFERRED_SEMAPHORES" = x"" ; then fi # Force _GNU_SOURCE on; plperl is broken with Perl 5.8.0 otherwise +# This is also required for ppoll(2), and perhaps other things CPPFLAGS="$CPPFLAGS -D_GNU_SOURCE" # If --enable-profiling is specified, we need -DLINUX_PROFILE