Thread: psql not responding to SIGINT upon db reconnection
Neon provides a quick start mechanism for psql using the following workflow: $ psql -h pg.neon.tech NOTICE: Welcome to Neon! Authenticate by visiting: https://console.neon.tech/psql_session/xxx Upon navigating to the link, the user selects their database to work with. The psql process is then connected to a Postgres database at Neon. psql provides a way to reconnect to the database from within itself: \c. If, after connecting to a database such as the process previously mentioned, the user starts a reconnection to the database from within psql, the process will refuse to interrupt the reconnection attempt after sending SIGINT to it. tristan=> \c NOTICE: Welcome to Neon! Authenticate by visiting: https://console.neon.tech/psql_session/yyy ^C ^C^C ^C^C I am not really sure if this was a problem on Windows machines, but the attached patch should support it if it does. If this was never a problem on Windows, it might be best to check for any regressions. This was originally discussed in the a GitHub issue[0] at Neon. [0]: https://github.com/neondatabase/neon/issues/1449 -- Tristan Partin Neon (https://neon.tech)
Attachment
On Mon, Jul 24, 2023 at 9:26 AM Tristan Partin <tristan@neon.tech> wrote: > attached patch + /* + * Restore the default SIGINT behavior while within libpq. Otherwise, we + * can never exit from polling for the database connection. Failure to + * restore is non-fatal. + */ + newact.sa_handler = SIG_DFL; + rc = sigaction(SIGINT, &newact, &oldact); There's no action taken if rc != 0. It doesn't seem right to continue as if everything's fine when the handler registration fails. At least a warning is warranted, so that the user reports such failures to the community. Best regards, Gurjeet http://Gurje.et
On Mon Jul 24, 2023 at 12:00 PM CDT, Gurjeet Singh wrote: > On Mon, Jul 24, 2023 at 9:26 AM Tristan Partin <tristan@neon.tech> wrote: > > > attached patch > > + /* > + * Restore the default SIGINT behavior while within libpq. > Otherwise, we > + * can never exit from polling for the database connection. Failure to > + * restore is non-fatal. > + */ > + newact.sa_handler = SIG_DFL; > + rc = sigaction(SIGINT, &newact, &oldact); > > There's no action taken if rc != 0. It doesn't seem right to > continue as if everything's fine when the handler registration fails. > At least a warning is warranted, so that the user reports such > failures to the community. If we fail to go back to the default handler, then we just get the behavior we currently have. I am not sure logging a message like "Failed to restore default SIGINT handler" is that useful to the user. It isn't actionable, and at the end of the day isn't going to affect them for the most part. They also aren't even aware that default handler was ever overridden in the first place. I'm more than happy to add a debug log if it is the blocker to getting this patch accepted however. -- Tristan Partin Neon (https://neon.tech)
v2 is attached which fixes a grammatical issue in a comment. -- Tristan Partin Neon (https://neon.tech)
Attachment
v3 is attached which fixes up some code comments I added which I hadn't attached to the commit already, sigh. -- Tristan Partin Neon (https://neon.tech)
Attachment
"Tristan Partin" <tristan@neon.tech> writes: > v3 is attached which fixes up some code comments I added which I hadn't > attached to the commit already, sigh. I don't care for this patch at all. You're bypassing the pqsignal abstraction layer that the rest of psql goes through, and the behavior you're implementing isn't very nice. People do not expect ^C to kill psql - it should just stop the \c attempt and leave you as you were. Admittedly, getting PQconnectdbParams to return control on SIGINT isn't too practical. But you could probably replace that with a loop around PQconnectPoll and test for CancelRequested in the loop. regards, tom lane
On Mon Jul 24, 2023 at 12:43 PM CDT, Tom Lane wrote: > "Tristan Partin" <tristan@neon.tech> writes: > > v3 is attached which fixes up some code comments I added which I hadn't > > attached to the commit already, sigh. > > I don't care for this patch at all. You're bypassing the pqsignal > abstraction layer that the rest of psql goes through, and the behavior > you're implementing isn't very nice. People do not expect ^C to > kill psql - it should just stop the \c attempt and leave you as you > were. > > Admittedly, getting PQconnectdbParams to return control on SIGINT > isn't too practical. But you could probably replace that with a loop > around PQconnectPoll and test for CancelRequested in the loop. That sounds like a much better solution. Attached you will find a v4 that implements your suggestion. Please let me know if there is something that I missed. I can confirm that the patch works. $ ./build/src/bin/psql/psql -h pg.neon.tech NOTICE: Welcome to Neon! Authenticate by visiting: https://console.neon.tech/psql_session/xxx NOTICE: Connecting to database. psql (17devel, server 15.3) SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) Type "help" for help. tristan957=> \c NOTICE: Welcome to Neon! Authenticate by visiting: https://console.neon.tech/psql_session/yyy ^Cconnection to server at "pg.neon.tech" (3.18.6.96), port 5432 failed: Previous connection kept tristan957=> -- Tristan Partin Neon (https://neon.tech)
Attachment
Hi, > That sounds like a much better solution. Attached you will find a v4 > that implements your suggestion. Please let me know if there is > something that I missed. I can confirm that the patch works. > > $ ./build/src/bin/psql/psql -h pg.neon.tech > NOTICE: Welcome to Neon! > Authenticate by visiting: > https://console.neon.tech/psql_session/xxx > > > NOTICE: Connecting to database. > psql (17devel, server 15.3) > SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) > Type "help" for help. > > tristan957=> \c > NOTICE: Welcome to Neon! > Authenticate by visiting: > https://console.neon.tech/psql_session/yyy > > > ^Cconnection to server at "pg.neon.tech" (3.18.6.96), port 5432 failed: > Previous connection kept > tristan957=> I went through CFbot and found out that some tests are timing out. Links: https://cirrus-ci.com/task/6735437444677632 https://cirrus-ci.com/task/4536414189125632 https://cirrus-ci.com/task/5046587584413696 https://cirrus-ci.com/task/6172487491256320 https://cirrus-ci.com/task/5609537537835008 Some tests which are timing out are as follows: [00:48:49.243] Summary of Failures: [00:48:49.243] [00:48:49.243] 4/270 postgresql:regress / regress/regress TIMEOUT 1000.01s [00:48:49.243] 6/270 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT 1000.02s [00:48:49.243] 34/270 postgresql:recovery / recovery/027_stream_regress TIMEOUT 1000.02s [00:48:49.243] 48/270 postgresql:plpgsql / plpgsql/regress TIMEOUT 1000.02s [00:48:49.243] 49/270 postgresql:plperl / plperl/regress TIMEOUT 1000.01s [00:48:49.243] 61/270 postgresql:dblink / dblink/regress TIMEOUT 1000.03s [00:48:49.243] 89/270 postgresql:postgres_fdw / postgres_fdw/regress TIMEOUT 1000.01s [00:48:49.243] 93/270 postgresql:test_decoding / test_decoding/regress TIMEOUT 1000.02s [00:48:49.243] 110/270 postgresql:test_extensions / test_extensions/regress TIMEOUT 1000.03s [00:48:49.243] 123/270 postgresql:unsafe_tests / unsafe_tests/regress TIMEOUT 1000.02s [00:48:49.243] 152/270 postgresql:pg_dump / pg_dump/010_dump_connstr TIMEOUT 1000.03s Just want to make sure you are aware of the issue. Thanks Shlok Kumar Kyal
On Thu Nov 2, 2023 at 4:03 AM CDT, Shlok Kyal wrote: > Hi, > > > That sounds like a much better solution. Attached you will find a v4 > > that implements your suggestion. Please let me know if there is > > something that I missed. I can confirm that the patch works. > > > > $ ./build/src/bin/psql/psql -h pg.neon.tech > > NOTICE: Welcome to Neon! > > Authenticate by visiting: > > https://console.neon.tech/psql_session/xxx > > > > > > NOTICE: Connecting to database. > > psql (17devel, server 15.3) > > SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off) > > Type "help" for help. > > > > tristan957=> \c > > NOTICE: Welcome to Neon! > > Authenticate by visiting: > > https://console.neon.tech/psql_session/yyy > > > > > > ^Cconnection to server at "pg.neon.tech" (3.18.6.96), port 5432 failed: > > Previous connection kept > > tristan957=> > > I went through CFbot and found out that some tests are timing out. > Links: > https://cirrus-ci.com/task/6735437444677632 > https://cirrus-ci.com/task/4536414189125632 > https://cirrus-ci.com/task/5046587584413696 > https://cirrus-ci.com/task/6172487491256320 > https://cirrus-ci.com/task/5609537537835008 > > Some tests which are timing out are as follows: > [00:48:49.243] Summary of Failures: > [00:48:49.243] > [00:48:49.243] 4/270 postgresql:regress / regress/regress TIMEOUT 1000.01s > [00:48:49.243] 6/270 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade > TIMEOUT 1000.02s > [00:48:49.243] 34/270 postgresql:recovery / > recovery/027_stream_regress TIMEOUT 1000.02s > [00:48:49.243] 48/270 postgresql:plpgsql / plpgsql/regress TIMEOUT 1000.02s > [00:48:49.243] 49/270 postgresql:plperl / plperl/regress TIMEOUT 1000.01s > [00:48:49.243] 61/270 postgresql:dblink / dblink/regress TIMEOUT 1000.03s > [00:48:49.243] 89/270 postgresql:postgres_fdw / postgres_fdw/regress > TIMEOUT 1000.01s > [00:48:49.243] 93/270 postgresql:test_decoding / test_decoding/regress > TIMEOUT 1000.02s > [00:48:49.243] 110/270 postgresql:test_extensions / > test_extensions/regress TIMEOUT 1000.03s > [00:48:49.243] 123/270 postgresql:unsafe_tests / unsafe_tests/regress > TIMEOUT 1000.02s > [00:48:49.243] 152/270 postgresql:pg_dump / pg_dump/010_dump_connstr > TIMEOUT 1000.03s > > Just want to make sure you are aware of the issue. Hi Shlok! Thanks for taking a look. I will check these failures out to see if I can reproduce. -- Tristan Partin Neon (https://neon.tech)
On 06/11/2023 19:16, Tristan Partin wrote: >>> That sounds like a much better solution. Attached you will find a v4 >>> that implements your suggestion. Please let me know if there is >>> something that I missed. I can confirm that the patch works. This patch is missing a select(). It will busy loop until the connection is established or cancelled. Shouldn't we also clear CancelRequested after we have cancelled the attempt? Otherwise, any subsequent attempts will immediately fail too. Should we use 'cancel_pressed' here rather than CancelRequested? To be honest, I don't understand the difference, so that's a genuine question. There was an attempt at unifying them in the past but it was reverted in commit 5d43c3c54d. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu Nov 16, 2023 at 8:33 AM CST, Heikki Linnakangas wrote: > On 06/11/2023 19:16, Tristan Partin wrote: > >>> That sounds like a much better solution. Attached you will find a v4 > >>> that implements your suggestion. Please let me know if there is > >>> something that I missed. I can confirm that the patch works. > > This patch is missing a select(). It will busy loop until the connection > is established or cancelled. If I add a wait (select, poll, etc.), then I can't control-C during the blocking call, so it doesn't really solve the problem. On Linux, we have signalfds which seem like the perfect solution to this problem, "wait on the pq socket or SIGINT." But that doesn't translate well to other operating systems :(. > tristan957=> \c > NOTICE: Welcome to Neon! > Authenticate by visiting: > https://console.neon.tech/psql_session/XXXX > > > ^CTerminated You can see here that I can't terminate the command. Where you see "Terminated" is me running `kill $(pgrep psql)` in another terminal. > Shouldn't we also clear CancelRequested after we have cancelled the > attempt? Otherwise, any subsequent attempts will immediately fail too. After switching to cancel_pressed, I don't think so. See this bit of code in the psql main loop: > /* main loop to get queries and execute them */ > while (successResult == EXIT_SUCCESS) > { > /* > * Clean up after a previous Control-C > */ > if (cancel_pressed) > { > if (!pset.cur_cmd_interactive) > { > /* > * You get here if you stopped a script with Ctrl-C. > */ > successResult = EXIT_USER; > break; > } > > cancel_pressed = false; > } > Should we use 'cancel_pressed' here rather than CancelRequested? To be > honest, I don't understand the difference, so that's a genuine question. > There was an attempt at unifying them in the past but it was reverted in > commit 5d43c3c54d. The more I look at this, the more I don't understand... I need to look at this harder, but wanted to get this email out. Switched to cancel_pressed though. Thanks for pointing it out. Going to wait to send a v2 while more discussion occurs. -- Tristan Partin Neon (https://neon.tech)
On 22/11/2023 19:29, Tristan Partin wrote: > On Thu Nov 16, 2023 at 8:33 AM CST, Heikki Linnakangas wrote: >> On 06/11/2023 19:16, Tristan Partin wrote: >>>>> That sounds like a much better solution. Attached you will find a v4 >>>>> that implements your suggestion. Please let me know if there is >>>>> something that I missed. I can confirm that the patch works. >> >> This patch is missing a select(). It will busy loop until the connection >> is established or cancelled. > > If I add a wait (select, poll, etc.), then I can't control-C during the > blocking call, so it doesn't really solve the problem. Hmm, they should return with EINTR on signal. At least on Linux; I'm not sure how portable that is. See signal(7) man page, section "Interruption of system calls and library functions by signal handlers". You could also use a timeout like 5 s to ensure that you wake up and notice that the signal was received eventually, even if it doesn't interrupt the blocking call. -- Heikki Linnakangas Neon (https://neon.tech)
On Wed Nov 22, 2023 at 3:00 PM CST, Heikki Linnakangas wrote: > On 22/11/2023 19:29, Tristan Partin wrote: > > On Thu Nov 16, 2023 at 8:33 AM CST, Heikki Linnakangas wrote: > >> On 06/11/2023 19:16, Tristan Partin wrote: > >>>>> That sounds like a much better solution. Attached you will find a v4 > >>>>> that implements your suggestion. Please let me know if there is > >>>>> something that I missed. I can confirm that the patch works. > >> > >> This patch is missing a select(). It will busy loop until the connection > >> is established or cancelled. > > > > If I add a wait (select, poll, etc.), then I can't control-C during the > > blocking call, so it doesn't really solve the problem. > > Hmm, they should return with EINTR on signal. At least on Linux; I'm not > sure how portable that is. See signal(7) man page, section "Interruption > of system calls and library functions by signal handlers". You could > also use a timeout like 5 s to ensure that you wake up and notice that > the signal was received eventually, even if it doesn't interrupt the > blocking call. Ha, you're right. I had this working yesterday, but convinced myself it didn't. I had a do while loop wrapping the blocking call. Here is a v4, which seems to pass the tests that were pointed out to be failing earlier. Noticed that I copy-pasted pqSocketPoll() into the psql code. I think this may be controversial. Not sure what the best solution to the issue is. I will pay attention to the buildfarm animals when they pick this up. -- Tristan Partin Neon (https://neon.tech)
Attachment
On 22/11/2023 23:29, Tristan Partin wrote: > Ha, you're right. I had this working yesterday, but convinced myself it > didn't. I had a do while loop wrapping the blocking call. Here is a v4, > which seems to pass the tests that were pointed out to be failing > earlier. Thanks! This suffers from a classic race condition: > + if (cancel_pressed) > + break; > + > + sock = PQsocket(n_conn); > + if (sock == -1) > + break; > + > + rc = pqSocketPoll(sock, for_read, !for_read, (time_t)-1); > + Assert(rc != 0); /* Timeout is impossible. */ > + if (rc == -1) > + { > + success = false; > + break; > + } If a signal arrives between the "if (cancel_pressed)" check and pqSocketPoll(), pgSocketPoll will "miss" the signal and block indefinitely. There are three solutions to this: 1. Use a timeout, so that you wake up periodically to check for any missed signals. Easy solution, the downsides are that you will not react as quickly if the signal is missed, and you waste some cycles by waking up unnecessarily. 2. The Self-pipe trick: https://cr.yp.to/docs/selfpipe.html. We also use that in src/backend/storage/ipc/latch.c. It's portable, but somewhat complicated. 3. Use pselect() or ppoll(). They were created specifically to address this problem. Not fully portable, I think it's missing on Windows at least. Maybe use pselect() if it's available, and select() with a timeout if it's not. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu Nov 23, 2023 at 3:19 AM CST, Heikki Linnakangas wrote: > On 22/11/2023 23:29, Tristan Partin wrote: > > Ha, you're right. I had this working yesterday, but convinced myself it > > didn't. I had a do while loop wrapping the blocking call. Here is a v4, > > which seems to pass the tests that were pointed out to be failing > > earlier. > > Thanks! This suffers from a classic race condition: > > > + if (cancel_pressed) > > + break; > > + > > + sock = PQsocket(n_conn); > > + if (sock == -1) > > + break; > > + > > + rc = pqSocketPoll(sock, for_read, !for_read, (time_t)-1); > > + Assert(rc != 0); /* Timeout is impossible. */ > > + if (rc == -1) > > + { > > + success = false; > > + break; > > + } > > If a signal arrives between the "if (cancel_pressed)" check and > pqSocketPoll(), pgSocketPoll will "miss" the signal and block > indefinitely. There are three solutions to this: > > 1. Use a timeout, so that you wake up periodically to check for any > missed signals. Easy solution, the downsides are that you will not react > as quickly if the signal is missed, and you waste some cycles by waking > up unnecessarily. > > 2. The Self-pipe trick: https://cr.yp.to/docs/selfpipe.html. We also use > that in src/backend/storage/ipc/latch.c. It's portable, but somewhat > complicated. > > 3. Use pselect() or ppoll(). They were created specifically to address > this problem. Not fully portable, I think it's missing on Windows at least. > > Maybe use pselect() if it's available, and select() with a timeout if > it's not. First I have ever heard of this race condition, and now I will commit it to durable storage :). I went ahead and did option #3 that you proposed. On Windows, I have a 1 second timeout for the select/poll. That seemed like a good balance of user experience and spinning the CPU. But I am open to other values. I don't have a Windows VM, but maybe I should set one up... I am not completely in love with the code I have written. Lots of conditional compilation which makes it hard to read. Looking forward to another round of review to see what y'all think. For what it's worth, I did try #2, but since psql installs its own SIGINT handler, the code became really hairy. -- Tristan Partin Neon (https://neon.tech)
Attachment
On Wed Nov 29, 2023 at 11:48 AM CST, Tristan Partin wrote: > I am not completely in love with the code I have written. Lots of > conditional compilation which makes it hard to read. Looking forward to > another round of review to see what y'all think. Ok. Here is a patch which just uses select(2) with a timeout of 1s or pselect(2) if it is available. I also moved the state machine processing into its own function. Thanks for your comments thus far. -- Tristan Partin Neon (https://neon.tech)
Attachment
On Tue, Dec 5, 2023 at 1:35 PM Tristan Partin <tristan@neon.tech> wrote: > On Wed Nov 29, 2023 at 11:48 AM CST, Tristan Partin wrote: > > I am not completely in love with the code I have written. Lots of > > conditional compilation which makes it hard to read. Looking forward to > > another round of review to see what y'all think. > > Ok. Here is a patch which just uses select(2) with a timeout of 1s or > pselect(2) if it is available. I also moved the state machine processing > into its own function. Hmm, this adds a function called pqSocketPoll to psql/command.c. But there already is such a function in libpq/fe-misc.c. It's not quite the same, though. Having the same function in two different modules with subtly different definitions seems like it's probably not the right idea. Also, this seems awfully complicated for something that's supposed to (checks notes) wait for a file descriptor to become ready for I/O for up to 1 second. It's 160 lines of code in pqSocketPoll and another 50 in the caller. If this is really the simplest way to do this, we really need to rethink our libpq APIs or, uh, something. I wonder if we could make this simpler by, say: - always use select - always use a 1-second timeout - if it returns faster because the race condition doesn't happen, cool - if it doesn't, well then you get to wait for a second, oh well I don't feel strongly that that's the right way to go and if Heikki or some other committer wants to go with this more complex conditional approach, that's fine. But to me it seems like a lot. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri Jan 5, 2024 at 12:24 PM CST, Robert Haas wrote: > On Tue, Dec 5, 2023 at 1:35 PM Tristan Partin <tristan@neon.tech> wrote: > > On Wed Nov 29, 2023 at 11:48 AM CST, Tristan Partin wrote: > > > I am not completely in love with the code I have written. Lots of > > > conditional compilation which makes it hard to read. Looking forward to > > > another round of review to see what y'all think. > > > > Ok. Here is a patch which just uses select(2) with a timeout of 1s or > > pselect(2) if it is available. I also moved the state machine processing > > into its own function. > > Hmm, this adds a function called pqSocketPoll to psql/command.c. But > there already is such a function in libpq/fe-misc.c. It's not quite > the same, though. Having the same function in two different modules > with subtly different definitions seems like it's probably not the > right idea. Yep, not tied to the function name. Happy to rename as anyone suggests. > Also, this seems awfully complicated for something that's supposed to > (checks notes) wait for a file descriptor to become ready for I/O for > up to 1 second. It's 160 lines of code in pqSocketPoll and another 50 > in the caller. If this is really the simplest way to do this, we > really need to rethink our libpq APIs or, uh, something. I wonder if > we could make this simpler by, say: > > - always use select > - always use a 1-second timeout > - if it returns faster because the race condition doesn't happen, cool > - if it doesn't, well then you get to wait for a second, oh well > > I don't feel strongly that that's the right way to go and if Heikki or > some other committer wants to go with this more complex conditional > approach, that's fine. But to me it seems like a lot. I think the way to go is to expose some variation of libpq's pqSocketPoll(), which I would be happy to put together a patch for. Making frontends, psql in this case, have to reimplement the polling logic doesn't strike me as fruitful, which is essentially what I have done. Thanks for your input! But also wait a second. In my last email, I said: > Ok. Here is a patch which just uses select(2) with a timeout of 1s or > pselect(2) if it is available. I also moved the state machine processing > into its own function. This is definitely not the patch I meant to send. What the? Here is what I meant to send, but I stand by my comment that we should just expose a variation of pqSocketPoll(). -- Tristan Partin Neon (https://neon.tech)
Attachment
On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin <tristan@neon.tech> wrote: > I think the way to go is to expose some variation of libpq's > pqSocketPoll(), which I would be happy to put together a patch for. > Making frontends, psql in this case, have to reimplement the polling > logic doesn't strike me as fruitful, which is essentially what I have > done. I encourage further exploration of this line of attack. I fear that if I were to commit something like what you've posted up until now, people would complain that that code was too ugly to live, and I'd have a hard time telling them that they're wrong. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri Jan 12, 2024 at 10:45 AM CST, Robert Haas wrote: > On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin <tristan@neon.tech> wrote: > > I think the way to go is to expose some variation of libpq's > > pqSocketPoll(), which I would be happy to put together a patch for. > > Making frontends, psql in this case, have to reimplement the polling > > logic doesn't strike me as fruitful, which is essentially what I have > > done. > > I encourage further exploration of this line of attack. I fear that if > I were to commit something like what you've posted up until now, > people would complain that that code was too ugly to live, and I'd > have a hard time telling them that they're wrong. Completely agree. Let me look into this. Perhaps I can get something up next week or the week after. -- Tristan Partin Neon (https://neon.tech)
On Fri Jan 12, 2024 at 11:13 AM CST, Tristan Partin wrote: > On Fri Jan 12, 2024 at 10:45 AM CST, Robert Haas wrote: > > On Mon, Jan 8, 2024 at 1:03 AM Tristan Partin <tristan@neon.tech> wrote: > > > I think the way to go is to expose some variation of libpq's > > > pqSocketPoll(), which I would be happy to put together a patch for. > > > Making frontends, psql in this case, have to reimplement the polling > > > logic doesn't strike me as fruitful, which is essentially what I have > > > done. > > > > I encourage further exploration of this line of attack. I fear that if > > I were to commit something like what you've posted up until now, > > people would complain that that code was too ugly to live, and I'd > > have a hard time telling them that they're wrong. > > Completely agree. Let me look into this. Perhaps I can get something up > next week or the week after. Not next week, but here is a respin. I've exposed pqSocketPoll as PQsocketPoll and am just using that. You can see the diff is so much smaller, which is great! In order to fight the race condition, I am just using a 1 second timeout instead of trying to integrate pselect or ppoll. We could add a PQsocketPPoll() to support those use cases, but I am not sure how available pselect and ppoll are. I guess on Windows we don't have pselect. I don't think using the pipe trick that Heikki mentioned earlier is suitable to expose via an API in libpq, but someone else might have a different opinion. Maybe this is good enough until someone complains? Most people would probably just chalk any latency between keypress and cancellation as network latency and not a hardcoded 1 second. Thanks for your feedback Robert! -- Tristan Partin Neon (https://neon.tech)
Attachment
On Tue, 30 Jan 2024 at 23:20, Tristan Partin <tristan@neon.tech> wrote: > Not next week, but here is a respin. I've exposed pqSocketPoll as > PQsocketPoll and am just using that. You can see the diff is so much > smaller, which is great! The exports.txt change should be made part of patch 0001, also docs are missing for the newly exposed function. PQsocketPoll does look like quite a nice API to expose from libpq.
On Tue Jan 30, 2024 at 4:42 PM CST, Jelte Fennema-Nio wrote: > On Tue, 30 Jan 2024 at 23:20, Tristan Partin <tristan@neon.tech> wrote: > > Not next week, but here is a respin. I've exposed pqSocketPoll as > > PQsocketPoll and am just using that. You can see the diff is so much > > smaller, which is great! > > The exports.txt change should be made part of patch 0001, also docs > are missing for the newly exposed function. PQsocketPoll does look > like quite a nice API to expose from libpq. I was looking for documentation of PQsocket(), but didn't find any standalone (unless I completely missed it). So I just copied how PQsocket() is documented in PQconnectPoll(). I am happy to document it separately if you think it would be useful. My bad on the exports.txt change being in the wrong commit. Git things... I will fix it on the next re-spin after resolving the previous paragraph. -- Tristan Partin Neon (https://neon.tech)
On Wed, 31 Jan 2024 at 19:07, Tristan Partin <tristan@neon.tech> wrote: > I was looking for documentation of PQsocket(), but didn't find any > standalone (unless I completely missed it). So I just copied how > PQsocket() is documented in PQconnectPoll(). I am happy to document it > separately if you think it would be useful. PQsocket its documentation is here: https://www.postgresql.org/docs/16/libpq-status.html#LIBPQ-PQSOCKET I think PQsocketPoll should probably have its API documentation (describing arguments and return value at minimum) in this section of the docs: https://www.postgresql.org/docs/16/libpq-async.html And I think the example in the PQconnnectPoll API docs could benefit from having PQsocketPoll used in that example.
On Wed, Jan 31, 2024 at 1:07 PM Tristan Partin <tristan@neon.tech> wrote: > I was looking for documentation of PQsocket(), but didn't find any > standalone (unless I completely missed it). So I just copied how > PQsocket() is documented in PQconnectPoll(). I am happy to document it > separately if you think it would be useful. As Jelte said back at the end of January, I think you just completely missed it. The relevant part of libpq.sgml starts like this: <varlistentry id="libpq-PQsocket"> <term><function>PQsocket</function><indexterm><primary>PQsocket</primary></indexterm></term> As far as I know, we document all of the exported libpq functions in that SGML file. I think there's no real reason why we couldn't get at least 0001 and maybe also 0002 into this release, but only if you move quickly on this. Feature freeze is approaching rapidly. Modulo the documentation changes, I think 0001 is pretty much ready to go. 0002 needs comments. I'm also not so sure about the name process_connection_state_machine(); it seems a little verbose. How about something like wait_until_connected()? And maybe put it below do_connect() instead of above. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri Mar 22, 2024 at 9:59 AM CDT, Robert Haas wrote: > On Wed, Jan 31, 2024 at 1:07 PM Tristan Partin <tristan@neon.tech> wrote: > > I was looking for documentation of PQsocket(), but didn't find any > > standalone (unless I completely missed it). So I just copied how > > PQsocket() is documented in PQconnectPoll(). I am happy to document it > > separately if you think it would be useful. > > As Jelte said back at the end of January, I think you just completely > missed it. The relevant part of libpq.sgml starts like this: > > <varlistentry id="libpq-PQsocket"> > <term><function>PQsocket</function><indexterm><primary>PQsocket</primary></indexterm></term> > > As far as I know, we document all of the exported libpq functions in > that SGML file. > > I think there's no real reason why we couldn't get at least 0001 and > maybe also 0002 into this release, but only if you move quickly on > this. Feature freeze is approaching rapidly. > > Modulo the documentation changes, I think 0001 is pretty much ready to > go. 0002 needs comments. I'm also not so sure about the name > process_connection_state_machine(); it seems a little verbose. How > about something like wait_until_connected()? And maybe put it below > do_connect() instead of above. Robert, Jelte: Sorry for taking a while to get back to y'all. I have taken your feedback into consideration for v9. This is my first time writing Postgres docs, so I'm ready for another round of editing :). Robert, could you point out some places where comments would be useful in 0002? I did rename the function and moved it as suggested, thanks! In turn, I also realized I forgot a prototype, so also added it. This patchset has also been rebased on master, so the libpq export number for PQsocketPoll was bumped. -- Tristan Partin Neon (https://neon.tech)
Attachment
On Fri, Mar 22, 2024 at 1:05 PM Tristan Partin <tristan@neon.tech> wrote: > Sorry for taking a while to get back to y'all. I have taken your > feedback into consideration for v9. This is my first time writing > Postgres docs, so I'm ready for another round of editing :). Yeah, that looks like it needs some wordsmithing yet. I can take a crack at that at some point, but here are a few notes: - "takes care of" and "working through the state machine" seem quite vague to me. - the meanings of forRead and forWrite don't seem to be documented. - "retuns" is a typo. > Robert, could you point out some places where comments would be useful > in 0002? I did rename the function and moved it as suggested, thanks! In > turn, I also realized I forgot a prototype, so also added it. Well, for starters, I'd give the function a header comment. Telling me that a 1 second timeout is a 1 second timeout is less useful than telling me why we've picked a 1 second timeout. Presumably the answer is "so we can respond to cancel interrupts in a timely fashion", but I'd make that explicit. It might be worth a comment saying that PQsocket() shouldn't be hoisted out of the loop because it could be a multi-host connection string and the socket might change under us. Unless that's not true, in which case we should hoist the PQsocket() call out of the loop. I think it would also be worth a comment saying that we don't need to report errors here, as the caller will do that; we just need to wait until the connection is known to have either succeeded or failed, or until the user presses cancel. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri Mar 22, 2024 at 12:17 PM CDT, Robert Haas wrote: > On Fri, Mar 22, 2024 at 1:05 PM Tristan Partin <tristan@neon.tech> wrote: > > Sorry for taking a while to get back to y'all. I have taken your > > feedback into consideration for v9. This is my first time writing > > Postgres docs, so I'm ready for another round of editing :). > > Yeah, that looks like it needs some wordsmithing yet. I can take a > crack at that at some point, but here are a few notes: > > - "takes care of" and "working through the state machine" seem quite > vague to me. > - the meanings of forRead and forWrite don't seem to be documented. > - "retuns" is a typo. > > > Robert, could you point out some places where comments would be useful > > in 0002? I did rename the function and moved it as suggested, thanks! In > > turn, I also realized I forgot a prototype, so also added it. > > Well, for starters, I'd give the function a header comment. > > Telling me that a 1 second timeout is a 1 second timeout is less > useful than telling me why we've picked a 1 second timeout. Presumably > the answer is "so we can respond to cancel interrupts in a timely > fashion", but I'd make that explicit. > > It might be worth a comment saying that PQsocket() shouldn't be > hoisted out of the loop because it could be a multi-host connection > string and the socket might change under us. Unless that's not true, > in which case we should hoist the PQsocket() call out of the loop. > > I think it would also be worth a comment saying that we don't need to > report errors here, as the caller will do that; we just need to wait > until the connection is known to have either succeeded or failed, or > until the user presses cancel. This is good feedback, thanks. I have added comments where you suggested. I reworded the PQsocketPoll docs to hopefully meet your expectations. I am using the term "connection sequence" which is from the PQconnectStartParams docs, so hopefully people will be able to make that connection. I wrote documentation for "forRead" and "forWrite" as well. I had a question about parameter naming. Right now I have a mix of camel-case and snake-case in the function signature since that is what I inherited. Should I change that to be consistent? If so, which case would you like? Thanks for your continued reviews. -- Tristan Partin Neon (https://neon.tech)
Attachment
On Fri, Mar 22, 2024 at 4:58 PM Tristan Partin <tristan@neon.tech> wrote: > I had a question about parameter naming. Right now I have a mix of > camel-case and snake-case in the function signature since that is what > I inherited. Should I change that to be consistent? If so, which case > would you like? Uh... PostgreSQL is kind of the wild west in that regard. The thing to do is look for nearby precedents, but that doesn't help much here because in the very same file, libpq-fe.h, we have: extern int PQsetResultAttrs(PGresult *res, int numAttributes, PGresAttDesc *attDescs); extern int PQsetvalue(PGresult *res, int tup_num, int field_num, char *value, int len); Since the existing naming is consistent with one of those two styles, I'd probably just leave it be. + The function returns a value greater than <literal>0</literal> if the specified condition + is met, <literal>0</literal> if a timeout occurred, or <literal>-1</literal> if an error + or interrupt occurred. In the event <literal>forRead</literal> and We either need to tell people how to find out which error it was, or if that's not possible and we can't reasonably make it possible, we need to tell them why they shouldn't care. Because there's nothing more delightful than someone who shows up and says "hey, I tried to do XYZ, and I got an error," as if that were sufficient information for me to do something useful. + <literal>end_time</literal> is the time in the future in seconds starting from the UNIX + epoch in which you would like the function to return if the condition is not met. This sentence seems a bit contorted to me, like maybe Yoda wrote it. I was about to try to rephrase it and maybe split it in two when I wondered why we need to document how time_t works at all. Can't we just say something like "If end_time is not -1, it specifies the time at which this function should stop waiting for the condition to be met" -- and maybe move it to the end of the first paragraph, so it's before where we list the meanings of the return values? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon Mar 25, 2024 at 1:44 PM CDT, Robert Haas wrote: > On Fri, Mar 22, 2024 at 4:58 PM Tristan Partin <tristan@neon.tech> wrote: > > I had a question about parameter naming. Right now I have a mix of > > camel-case and snake-case in the function signature since that is what > > I inherited. Should I change that to be consistent? If so, which case > > would you like? > > Uh... PostgreSQL is kind of the wild west in that regard. The thing to > do is look for nearby precedents, but that doesn't help much here > because in the very same file, libpq-fe.h, we have: > > extern int PQsetResultAttrs(PGresult *res, int numAttributes, > PGresAttDesc *attDescs); > extern int PQsetvalue(PGresult *res, int tup_num, int field_num, > char *value, int len); > > Since the existing naming is consistent with one of those two styles, > I'd probably just leave it be. > > + The function returns a value greater than <literal>0</literal> > if the specified condition > + is met, <literal>0</literal> if a timeout occurred, or > <literal>-1</literal> if an error > + or interrupt occurred. In the event <literal>forRead</literal> and > > We either need to tell people how to find out which error it was, or > if that's not possible and we can't reasonably make it possible, we > need to tell them why they shouldn't care. Because there's nothing > more delightful than someone who shows up and says "hey, I tried to do > XYZ, and I got an error," as if that were sufficient information for > me to do something useful. > > + <literal>end_time</literal> is the time in the future in > seconds starting from the UNIX > + epoch in which you would like the function to return if the > condition is not met. > > This sentence seems a bit contorted to me, like maybe Yoda wrote it. I > was about to try to rephrase it and maybe split it in two when I > wondered why we need to document how time_t works at all. Can't we > just say something like "If end_time is not -1, it specifies the time > at which this function should stop waiting for the condition to be > met" -- and maybe move it to the end of the first paragraph, so it's > before where we list the meanings of the return values? Incorporated feedback, I have :). -- Tristan Partin Neon (https://neon.tech)
Attachment
On Mon, Apr 1, 2024 at 12:04 PM Tristan Partin <tristan@neon.tech> wrote: > > This sentence seems a bit contorted to me, like maybe Yoda wrote it. > > Incorporated feedback, I have :). Committed it, I did. My thanks for working on this issue, I extend. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue Apr 2, 2024 at 9:32 AM CDT, Robert Haas wrote: > On Mon, Apr 1, 2024 at 12:04 PM Tristan Partin <tristan@neon.tech> wrote: > > > This sentence seems a bit contorted to me, like maybe Yoda wrote it. > > > > Incorporated feedback, I have :). > > Committed it, I did. My thanks for working on this issue, I extend. Thanks to everyone for their reviews! Patch is in a much better place than when it started. -- Tristan Partin Neon (https://neon.tech)
On Tue, 2 Apr 2024 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote: > Committed it, I did. My thanks for working on this issue, I extend. Looking at the committed version of this patch, the pg_unreachable calls seemed weird to me. 1 is actually incorrect, thus possibly resulting in undefined behaviour. And for the other call an imho better fix would be to remove the now 21 year unused enum variant, instead of introducing its only reference in the whole codebase. Attached are two trivial patches, feel free to remove both of the pg_unreachable calls.
Attachment
On Wed Apr 3, 2024 at 8:32 AM CDT, Jelte Fennema-Nio wrote: > On Tue, 2 Apr 2024 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote: > > Committed it, I did. My thanks for working on this issue, I extend. > > Looking at the committed version of this patch, the pg_unreachable > calls seemed weird to me. 1 is actually incorrect, thus possibly > resulting in undefined behaviour. And for the other call an imho > better fix would be to remove the now 21 year unused enum variant, > instead of introducing its only reference in the whole codebase. > > Attached are two trivial patches, feel free to remove both of the > pg_unreachable calls. Patches look good. Sorry about causing you to do some work. -- Tristan Partin Neon (https://neon.tech)
Jelte Fennema-Nio <postgres@jeltef.nl> writes: > Looking at the committed version of this patch, the pg_unreachable > calls seemed weird to me. 1 is actually incorrect, thus possibly > resulting in undefined behaviour. And for the other call an imho > better fix would be to remove the now 21 year unused enum variant, > instead of introducing its only reference in the whole codebase. If we do the latter, we will almost certainly get pushback from distros who check for library ABI breaks. I fear the comment suggesting that we could remove it someday is too optimistic. regards, tom lane
On Wed, 3 Apr 2024 at 16:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If we do the latter, we will almost certainly get pushback from > distros who check for library ABI breaks. I fear the comment > suggesting that we could remove it someday is too optimistic. Alright, changed patch 0002 to keep the variant. But remove it from the recently added switch/case. And also updated the comment to remove the "for awhile".
Attachment
On Wed Apr 3, 2024 at 9:50 AM CDT, Jelte Fennema-Nio wrote: > On Wed, 3 Apr 2024 at 16:31, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > If we do the latter, we will almost certainly get pushback from > > distros who check for library ABI breaks. I fear the comment > > suggesting that we could remove it someday is too optimistic. > > Alright, changed patch 0002 to keep the variant. But remove it from > the recently added switch/case. And also updated the comment to remove > the "for awhile". Removing from the switch statement causes a warning: > [1/2] Compiling C object src/bin/psql/psql.p/command.c.o > ../src/bin/psql/command.c: In function ‘wait_until_connected’: > ../src/bin/psql/command.c:3803:17: warning: enumeration value ‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch] > 3803 | switch (PQconnectPoll(conn)) > | ^~~~~~ -- Tristan Partin Neon (https://neon.tech)
On Wed, 3 Apr 2024 at 16:55, Tristan Partin <tristan@neon.tech> wrote: > Removing from the switch statement causes a warning: > > > [1/2] Compiling C object src/bin/psql/psql.p/command.c.o > > ../src/bin/psql/command.c: In function ‘wait_until_connected’: > > ../src/bin/psql/command.c:3803:17: warning: enumeration value ‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch] > > 3803 | switch (PQconnectPoll(conn)) > > | ^~~~~~ Ofcourse... fixed now
Attachment
On Wed Apr 3, 2024 at 10:05 AM CDT, Jelte Fennema-Nio wrote: > On Wed, 3 Apr 2024 at 16:55, Tristan Partin <tristan@neon.tech> wrote: > > Removing from the switch statement causes a warning: > > > > > [1/2] Compiling C object src/bin/psql/psql.p/command.c.o > > > ../src/bin/psql/command.c: In function ‘wait_until_connected’: > > > ../src/bin/psql/command.c:3803:17: warning: enumeration value ‘PGRES_POLLING_ACTIVE’ not handled in switch [-Wswitch] > > > 3803 | switch (PQconnectPoll(conn)) > > > | ^~~~~~ > > > Ofcourse... fixed now I think patch 2 makes it worse. The value in -Wswitch is that when new enum variants are added, the developer knows the locations to update. Adding a default case makes -Wswitch pointless. Patch 1 is still good. The comment change in patch 2 is good too! -- Tristan Partin Neon (https://neon.tech)
On Wed, Apr 3, 2024 at 11:17 AM Tristan Partin <tristan@neon.tech> wrote: > I think patch 2 makes it worse. The value in -Wswitch is that when new > enum variants are added, the developer knows the locations to update. > Adding a default case makes -Wswitch pointless. > > Patch 1 is still good. The comment change in patch 2 is good too! It seems to me that 0001 should either remove the pg_unreachable() call or change the break to a return, but not both. The commit message tries to justify doing both by saying that the pg_unreachable() call doesn't have much value, but there's not much value in omitting pg_unreachable() from unreachable places, either, so I'm not convinced. I agree with Tristan's analysis of 0002. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 3 Apr 2024 at 21:49, Robert Haas <robertmhaas@gmail.com> wrote: > It seems to me that 0001 should either remove the pg_unreachable() > call or change the break to a return, but not both. Updated the patch to only remove the pg_unreachable call (and keep the breaks). > but there's not much value in omitting > pg_unreachable() from unreachable places, either, so I'm not > convinced. But I don't agree with this. Having pg_unreachable in places where it brings no perf benefit has two main downsides: 1. It's extra lines of code 2. If a programmer/reviewer is not careful about maintaining this unreachable invariant (now or in the future), undefined behavior can be introduced quite easily. Also, I'd expect any optimizing compiler to know that the code after the loop is already unreachable if there are no break/goto statements in its body. > I agree with Tristan's analysis of 0002. Updated 0002, to only change the comment. But honestly I don't think using "default" really introduces many chances for future bugs in this case, since it seems very unlikely we'll ever add new variants to the PostgresPollingStatusType enum.
Attachment
On Thu, Apr 4, 2024 at 6:43 AM Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > But I don't agree with this. Having pg_unreachable in places where it > brings no perf benefit has two main downsides: > > 1. It's extra lines of code > 2. If a programmer/reviewer is not careful about maintaining this > unreachable invariant (now or in the future), undefined behavior can > be introduced quite easily. > > Also, I'd expect any optimizing compiler to know that the code after > the loop is already unreachable if there are no break/goto statements > in its body. I'm not super-well-informed about the effects of pg_unreachable(), but I feel like it's a little strange to complain that it's an extra line of code. Presumably, it translates into no executable instructions, and might even reduce the size of the generated code. Now, it could still be a cognitive burden on human readers, but hopefully it should do the exact opposite: it should make the code easier to understand by documenting that the author thought that a certain point in the code could not be reached. In that sense, it's like a code comment. Likewise, it certainly is true that one must be careful to put pg_unreachable() only in places that aren't reachable, and to add or remove it as appropriate if the set of unreachable places changes. But it's also true that one must be careful to put any other thing that looks like a function call only in places where that thing is appropriate. > > I agree with Tristan's analysis of 0002. > > Updated 0002, to only change the comment. But honestly I don't think > using "default" really introduces many chances for future bugs in this > case, since it seems very unlikely we'll ever add new variants to the > PostgresPollingStatusType enum. That might be true, but we've avoided using default: for switches over lots of other enum types in lots of other places in PostgreSQL, and overall it's saved us from lots of bugs. I'm not a stickler about this; if there's some reason not to do it in some particular case, that's fine with me. But where it's possible to do it without any real disadvantage, I think it's a healthy practice. I have committed these versions of the patches. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks Jelte and Robert for the extra effort to correct my mistake. I apologize. Bad copy-paste from a previous revision of the patch at some point. -- Tristan Partin Neon (https://neon.tech)