Thread: pgsql: Add tests for '-f' option in dropdb utility.
Add tests for '-f' option in dropdb utility. This will test that the force option works when there is an active backend connected to the database being dropped. Author: Pavel Stehule and Vignesh C Reviewed-by: Amit Kapila and Vignesh C Discussion: https://postgr.es/m/CAP_rwwmLJJbn70vLOZFpxGw3XD7nLB_7+NKz46H5EOO2k5H7OQ@mail.gmail.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/8a7e9e9dad56419ff987e5f6baaf411a03c1951a Modified Files -------------- src/bin/scripts/t/050_dropdb.pl | 8 +-- src/bin/scripts/t/051_dropdb_force.pl | 104 ++++++++++++++++++++++++++++++++++ 2 files changed, 105 insertions(+), 7 deletions(-)
On Thu, Nov 28, 2019 at 11:49 AM Amit Kapila <akapila@postgresql.org> wrote: > > Add tests for '-f' option in dropdb utility. > drongo doesn't seem to like this commit [1]. Below is shown in failure report: # aborting wait: program died # stream contents: >>psql:<stdin>:4: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. # psql:<stdin>:4: fatal: connection to server was lost # << # pattern searched for: (?^m:FATAL: terminating connection due to administrator command) It seems to indicate that the message is not matched after we have forcefully terminated it by Drop Database .. (force);. The first thought that came to mind is that it is specific to this system as this passes on other machines. It is possible that it is Windows-specific. I'll investigate it further. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-11-28%2009%3A57%3A30 -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 28, 2019 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 28, 2019 at 11:49 AM Amit Kapila <akapila@postgresql.org> wrote: > > > > Add tests for '-f' option in dropdb utility. > > > > drongo doesn't seem to like this commit [1]. Below is shown in failure report: > > # aborting wait: program died > # stream contents: >>psql:<stdin>:4: server closed the connection unexpectedly > # This probably means the server terminated abnormally > # before or while processing the request. > # psql:<stdin>:4: fatal: connection to server was lost > # << > # pattern searched for: (?^m:FATAL: terminating connection due to > administrator command) > I checked that one other similar usage in 013_crash_restart.pl is expecting below message: /WARNING: terminating connection because of crash of another server process|server closed the connection unexpectedly|connection to server was lost The message expected by this test was: "FATAL: terminating connection due to administrator command" which seems to be fine in most cases but not all. I think we just need to change the expected message the same as in 013_crash_restart.pl to make this test pass on all platforms. I will do a bit more study and prepare a test setup in a similar environment to test this case tomorrow. I don't want to do this today in a hurry without testing it carefully. I hope that is fine. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > The message expected by this test was: "FATAL: terminating connection > due to administrator command" which seems to be fine in most cases but > not all. I think the correct question to ask is "why not all cases"? > I think we just need to change the expected message the same as in > 013_crash_restart.pl to make this test pass on all platforms. If DROP FORCE sometimes gives victim sessions the impression that the server crashed, rather than that they were kicked off intentionally, I think that's totally unacceptable. "Fixing" this by changing the test to allow it is not the answer, and if we can't find a better answer, I'm going to vote to revert the feature. regards, tom lane
On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > The message expected by this test was: "FATAL: terminating connection > > due to administrator command" which seems to be fine in most cases but > > not all. > > I think the correct question to ask is "why not all cases"? > As of now, it seems to me that this happens only on Windows. I am not sure why so? I will investigate this further and share my findings. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> writes: > On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think the correct question to ask is "why not all cases"? > As of now, it seems to me that this happens only on Windows. I am not > sure why so? I will investigate this further and share my findings. A couple of interesting things stand out from looking at the buildfarm failures: * On some of the machines, it seems like "chomp" is failing to get rid of all the trailing whitespace in $pid: ok 4 - acquired pid for SIGTERM not ok 5 - database foobar1 is used # Failed test 'database foobar1 is used' # at t/051_dropdb_force.pl line 71. # got: '212024' # expected: '212024 ' How can that be? It somewhat-accidentally doesn't seem to be causing any additional problems, but still we need this test step to work (or else remove it, it's not really essential). * On all the failing machines, it's very clear from the postmaster log that the backend knows why it's being terminated: 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL: terminating connection due to administrator command So the question seems to be why libpq isn't reporting that message before it detects connection-closed. This triggered a vague memory, and after a bit of archives-digging, I found this thread from a few months back: https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac in which it's alleged that Windows' TCP stack is flat-out broken and will drop not-yet-delivered data when the server closes the connection. If that's true, it's pretty nasty. Windows is about the last platform where I'd want us to have behavior like this, because we *will* get bug reports about it from novices. If there's no other workaround, I'm tempted to propose #ifdef WIN32 pg_sleep(1 second); #endif or something close to that, before we close the socket. Or we could revert the whole feature. I'm not sure it's worth fighting portability wars for. regards, tom lane
On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think the correct question to ask is "why not all cases"? > > > As of now, it seems to me that this happens only on Windows. I am not > > sure why so? I will investigate this further and share my findings. > > A couple of interesting things stand out from looking at the buildfarm > failures: > > * On some of the machines, it seems like "chomp" is failing to get > rid of all the trailing whitespace in $pid: > > ok 4 - acquired pid for SIGTERM > not ok 5 - database foobar1 is used > > # Failed test 'database foobar1 is used' > # at t/051_dropdb_force.pl line 71. > # got: '212024' > # expected: '212024 > ' > > How can that be? It somewhat-accidentally doesn't seem to be > causing any additional problems, but still we need this test > step to work (or else remove it, it's not really essential). > Yeah, we need to do something about this, if nothing works, we can remove this step from the test, but let us first decide what to do about the next point. > * On all the failing machines, it's very clear from the postmaster > log that the backend knows why it's being terminated: > > 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL: terminating connection due to administrator command > Yeah, I can confirm this behavior on the Windows machines. I have also seen that we already expose this behavior in more than one way and all behave similar to this. If I use pg_terminate_backend(<pid>) or pg_ctl kill TERM <pid>, the behavior is exactly the same (terminated backend doesn't get the message (FATAL: terminating connection due to administrator command), but it is present in postmaster log. > So the question seems to be why libpq isn't reporting that > message before it detects connection-closed. > > This triggered a vague memory, and after a bit of archives-digging, > I found this thread from a few months back: > > https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac > > in which it's alleged that Windows' TCP stack is flat-out > broken and will drop not-yet-delivered data when the server > closes the connection. > > If that's true, it's pretty nasty. Windows is about the > last platform where I'd want us to have behavior like this, > because we *will* get bug reports about it from novices. > > If there's no other workaround, I'm tempted to propose > > #ifdef WIN32 > pg_sleep(1 second); > #endif > > or something close to that, before we close the socket. > I can experiment with this or if something else occurred to me. > Or we could revert the whole feature. > Yeah, that is also one possibility, but I think given we already have this behavior in existing features, it is better to either come up with some solution or maybe mention in docs that in such cases users need to check postmaster log to know the actual reason. I think we can further explore this, but for now, we might want to (a) revert this test, or (b) change the expected output to match. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > * On all the failing machines, it's very clear from the postmaster > > log that the backend knows why it's being terminated: > > > > 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL: terminating connection due to administrator command > > > > Yeah, I can confirm this behavior on the Windows machines. I have > also seen that we already expose this behavior in more than one way > and all behave similar to this. If I use pg_terminate_backend(<pid>) > or pg_ctl kill TERM <pid>, the behavior is exactly the same > (terminated backend doesn't get the message (FATAL: terminating > connection due to administrator command), but it is present in > postmaster log. > > > So the question seems to be why libpq isn't reporting that > > message before it detects connection-closed. > > > > This triggered a vague memory, and after a bit of archives-digging, > > I found this thread from a few months back: > > > > https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac > > > > in which it's alleged that Windows' TCP stack is flat-out > > broken and will drop not-yet-delivered data when the server > > closes the connection. > > > > If that's true, it's pretty nasty. Windows is about the > > last platform where I'd want us to have behavior like this, > > because we *will* get bug reports about it from novices. > > > > If there's no other workaround, I'm tempted to propose > > > > #ifdef WIN32 > > pg_sleep(1 second); > > #endif > > > > or something close to that, before we close the socket. > > > > I can experiment with this or if something else occurred to me. > The above experiment suggested by you works and the test passes after this in the local Windows setup. One more thing, I think we don't need to do above for graceful exits and during socket_close we have access to exit_status_code which can be used in this context. I have attached the patch for this. I think something on these lines is even suggested by MSDN [1], but I am not sure. I have also tried calling closesocket() explicitly in our function socket_close which has changed the error message to "could not receive data from server: Software caused connection abort (0x00002745/10053)". I am not sure if it is any better. Another thing I have tried to use is SO_LINGER option of socket but that didn't work out. > > Or we could revert the whole feature. > > > > Yeah, that is also one possibility, but I think given we already have > this behavior in existing features, it is better to either come up > with some solution or maybe mention in docs that in such cases users > need to check postmaster log to know the actual reason. > > I think we can further explore this, but for now, we might want to (a) > revert this test, or (b) change the expected output to match. > I think if we decide to go with the above solution for Windows, then we can keep the current test as it is or probably remove one test related to <pid> test, otherwise, it is better to go with (a) for now and once we decide any solution for this we can again commit these tests. [1] - https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket See below note in the above link: Using the closesocket or shutdown functions with SD_SEND or SD_BOTH results in a RELEASE signal being sent out on the control channel. Due to ATM's use of separate signal and data channels, it is possible that a RELEASE signal could reach the remote end before the last of the data reaches its destination, resulting in a loss of that data. One possible solutions is programming a sufficient delay between the last data sent and the closesocket or shutdown function calls for an ATM socket. Half close is not supported by ATM. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Nov 29, 2019 at 3:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > * On all the failing machines, it's very clear from the postmaster > > > log that the backend knows why it's being terminated: > > > > > > 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL: terminating connection due to administrator command > > > > > > > Yeah, I can confirm this behavior on the Windows machines. I have > > also seen that we already expose this behavior in more than one way > > and all behave similar to this. If I use pg_terminate_backend(<pid>) > > or pg_ctl kill TERM <pid>, the behavior is exactly the same > > (terminated backend doesn't get the message (FATAL: terminating > > connection due to administrator command), but it is present in > > postmaster log. > > > > > So the question seems to be why libpq isn't reporting that > > > message before it detects connection-closed. > > > > > > This triggered a vague memory, and after a bit of archives-digging, > > > I found this thread from a few months back: > > > > > > https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac > > > > > > in which it's alleged that Windows' TCP stack is flat-out > > > broken and will drop not-yet-delivered data when the server > > > closes the connection. > > > > > > If that's true, it's pretty nasty. Windows is about the > > > last platform where I'd want us to have behavior like this, > > > because we *will* get bug reports about it from novices. > > > > > > If there's no other workaround, I'm tempted to propose > > > > > > #ifdef WIN32 > > > pg_sleep(1 second); > > > #endif > > > > > > or something close to that, before we close the socket. > > > > > > > I can experiment with this or if something else occurred to me. > > > > The above experiment suggested by you works and the test passes after > this in the local Windows setup. One more thing, I think we don't > need to do above for graceful exits and during socket_close we have > access to exit_status_code which can be used in this context. I have > attached the patch for this. I think something on these lines is even > suggested by MSDN [1], but I am not sure. > On further reading the Microsoft docs, it seems that even though it suggests something which can work in our current situation, but it is in a different context. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 29, 2019 at 3:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > If there's no other workaround, I'm tempted to propose > > > > > > #ifdef WIN32 > > > pg_sleep(1 second); > > > #endif > > > > > > or something close to that, before we close the socket. > > > > > > > I can experiment with this or if something else occurred to me. > > > > The above experiment suggested by you works and the test passes after > this in the local Windows setup. One more thing, I think we don't > need to do above for graceful exits and during socket_close we have > access to exit_status_code which can be used in this context. I have > attached the patch for this. I think something on these lines is even > suggested by MSDN [1], but I am not sure. > .. > > > > Or we could revert the whole feature. > > > > > > > Yeah, that is also one possibility, but I think given we already have > > this behavior in existing features, it is better to either come up > > with some solution or maybe mention in docs that in such cases users > > need to check postmaster log to know the actual reason. > > > > I think we can further explore this, but for now, we might want to (a) > > revert this test, or (b) change the expected output to match. > > > > I think if we decide to go with the above solution for Windows, then > we can keep the current test as it is or probably remove one test > related to <pid> test, otherwise, it is better to go with (a) for now > and once we decide any solution for this we can again commit these > tests. > I am planning to revert the commits (8a7e9e9dad and 290acac92b) done for tests tomorrow morning and then separately explore the solution for the Windows-specific problem (it drops not-yet-delivered data when the server closes the connection) as it seems to impact multiple features. It is quite possible that the best we can do is on the lines of what you have suggested (sleep for short duration before closing the socket), but there is no harm in spending some more time and see if we can come up with something better. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com