Thread: pgsql: Add tests for '-f' option in dropdb utility.

pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

28 November 2019, 09:18:55

Add tests for '-f' option in dropdb utility.

This will test that the force option works when there is an active backend
connected to the database being dropped.

Author: Pavel Stehule and Vignesh C
Reviewed-by: Amit Kapila and Vignesh C
Discussion: https://postgr.es/m/CAP_rwwmLJJbn70vLOZFpxGw3XD7nLB_7+NKz46H5EOO2k5H7OQ@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/8a7e9e9dad56419ff987e5f6baaf411a03c1951a

Modified Files
--------------
src/bin/scripts/t/050_dropdb.pl       |   8 +--
src/bin/scripts/t/051_dropdb_force.pl | 104 ++++++++++++++++++++++++++++++++++
2 files changed, 105 insertions(+), 7 deletions(-)

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

28 November 2019, 16:29:48

On Thu, Nov 28, 2019 at 11:49 AM Amit Kapila <akapila@postgresql.org> wrote:
>
> Add tests for '-f' option in dropdb utility.
>

drongo doesn't seem to like this commit [1].  Below is shown in failure report:

# aborting wait: program died
# stream contents: >>psql:<stdin>:4: server closed the connection unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# psql:<stdin>:4: fatal: connection to server was lost
# <<
# pattern searched for: (?^m:FATAL:  terminating connection due to
administrator command)

It seems to indicate that the message is not matched after we have
forcefully terminated it by Drop Database .. (force);.  The first
thought that came to mind is that it is specific to this system as
this passes on other machines.  It is possible that it is
Windows-specific.  I'll investigate it further.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2019-11-28%2009%3A57%3A30

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

28 November 2019, 16:49:48

On Thu, Nov 28, 2019 at 6:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 28, 2019 at 11:49 AM Amit Kapila <akapila@postgresql.org> wrote:
> >
> > Add tests for '-f' option in dropdb utility.
> >
>
> drongo doesn't seem to like this commit [1].  Below is shown in failure report:
>
> # aborting wait: program died
> # stream contents: >>psql:<stdin>:4: server closed the connection unexpectedly
> # This probably means the server terminated abnormally
> # before or while processing the request.
> # psql:<stdin>:4: fatal: connection to server was lost
> # <<
> # pattern searched for: (?^m:FATAL:  terminating connection due to
> administrator command)
>

I checked that one other similar usage in 013_crash_restart.pl is
expecting below message:
/WARNING:  terminating connection because of crash of another server
process|server closed the connection unexpectedly|connection to server
was lost

The message expected by this test was: "FATAL:  terminating connection
due to administrator command" which seems to be fine in most cases but
not all.

I think we just need to change the expected message the same as in
013_crash_restart.pl to make this test pass on all platforms.  I will
do a bit more study and prepare a test setup in a similar environment
to test this case tomorrow.  I don't want to do this today in a hurry
without testing it carefully.  I hope that is fine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Tom Lane

Date:

28 November 2019, 17:46:09

Amit Kapila <amit.kapila16@gmail.com> writes:
> The message expected by this test was: "FATAL:  terminating connection
> due to administrator command" which seems to be fine in most cases but
> not all.

I think the correct question to ask is "why not all cases"?

> I think we just need to change the expected message the same as in
> 013_crash_restart.pl to make this test pass on all platforms.

If DROP FORCE sometimes gives victim sessions the impression that the
server crashed, rather than that they were kicked off intentionally,
I think that's totally unacceptable.  "Fixing" this by changing the
test to allow it is not the answer, and if we can't find a better
answer, I'm going to vote to revert the feature.

            regards, tom lane

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

28 November 2019, 18:43:19

On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > The message expected by this test was: "FATAL:  terminating connection
> > due to administrator command" which seems to be fine in most cases but
> > not all.
>
> I think the correct question to ask is "why not all cases"?
>

As of now, it seems to me that this happens only on Windows.  I am not
sure why so?  I will investigate this further and share my findings.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Tom Lane

Date:

28 November 2019, 23:55:21

Amit Kapila <amit.kapila16@gmail.com> writes:
> On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think the correct question to ask is "why not all cases"?

> As of now, it seems to me that this happens only on Windows. I am not
> sure why so? I will investigate this further and share my findings.

A couple of interesting things stand out from looking at the buildfarm
failures:

* On some of the machines, it seems like "chomp" is failing to get
rid of all the trailing whitespace in $pid:

ok 4 - acquired pid for SIGTERM
not ok 5 - database foobar1 is used

# Failed test 'database foobar1 is used'
# at t/051_dropdb_force.pl line 71.
# got: '212024'
# expected: '212024
'

How can that be? It somewhat-accidentally doesn't seem to be
causing any additional problems, but still we need this test
step to work (or else remove it, it's not really essential).

* On all the failing machines, it's very clear from the postmaster
log that the backend knows why it's being terminated:

2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL: terminating connection due to administrator command

So the question seems to be why libpq isn't reporting that
message before it detects connection-closed.

This triggered a vague memory, and after a bit of archives-digging,
I found this thread from a few months back:

https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac

in which it's alleged that Windows' TCP stack is flat-out
broken and will drop not-yet-delivered data when the server
closes the connection.

If that's true, it's pretty nasty. Windows is about the
last platform where I'd want us to have behavior like this,
because we *will* get bug reports about it from novices.

If there's no other workaround, I'm tempted to propose

#ifdef WIN32
pg_sleep(1 second);
#endif

or something close to that, before we close the socket.

Or we could revert the whole feature. I'm not sure it's
worth fighting portability wars for.

regards, tom lane

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

29 November 2019, 05:12:23

On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Thu, Nov 28, 2019 at 8:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I think the correct question to ask is "why not all cases"?
>
> > As of now, it seems to me that this happens only on Windows.  I am not
> > sure why so?  I will investigate this further and share my findings.
>
> A couple of interesting things stand out from looking at the buildfarm
> failures:
>
> * On some of the machines, it seems like "chomp" is failing to get
> rid of all the trailing whitespace in $pid:
>
> ok 4 - acquired pid for SIGTERM
> not ok 5 - database foobar1 is used
>
> #   Failed test 'database foobar1 is used'
> #   at t/051_dropdb_force.pl line 71.
> #          got: '212024'
> #     expected: '212024
> '
>
> How can that be?  It somewhat-accidentally doesn't seem to be
> causing any additional problems, but still we need this test
> step to work (or else remove it, it's not really essential).
>

Yeah, we need to do something about this, if nothing works, we can
remove this step from the test, but let us first decide what to do
about the next point.

> * On all the failing machines, it's very clear from the postmaster
> log that the backend knows why it's being terminated:
>
> 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL:  terminating connection due to administrator
command
>

Yeah, I can confirm this behavior on the Windows machines.  I have
also seen that we already expose this behavior in more than one way
and all behave similar to this.  If I use pg_terminate_backend(<pid>)
or pg_ctl kill TERM <pid>, the behavior is exactly the same
(terminated backend doesn't get the message (FATAL:  terminating
connection due to administrator command), but it is present in
postmaster log.

> So the question seems to be why libpq isn't reporting that
> message before it detects connection-closed.
>
> This triggered a vague memory, and after a bit of archives-digging,
> I found this thread from a few months back:
>
>
https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac
>
> in which it's alleged that Windows' TCP stack is flat-out
> broken and will drop not-yet-delivered data when the server
> closes the connection.
>
> If that's true, it's pretty nasty.  Windows is about the
> last platform where I'd want us to have behavior like this,
> because we *will* get bug reports about it from novices.
>
> If there's no other workaround, I'm tempted to propose
>
> #ifdef WIN32
>         pg_sleep(1 second);
> #endif
>
> or something close to that, before we close the socket.
>

I can experiment with this or if something else occurred to me.

> Or we could revert the whole feature.
>

Yeah, that is also one possibility, but I think given we already have
this behavior in existing features, it is better to either come up
with some solution or maybe mention in docs that in such cases users
need to check postmaster log to know the actual reason.

I think we can further explore this, but for now, we might want to (a)
revert this test, or (b) change the expected output to match.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

29 November 2019, 12:54:12

On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
>
> > * On all the failing machines, it's very clear from the postmaster
> > log that the backend knows why it's being terminated:
> >
> > 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL:  terminating connection due to administrator
command
> >
>
> Yeah, I can confirm this behavior on the Windows machines.  I have
> also seen that we already expose this behavior in more than one way
> and all behave similar to this.  If I use pg_terminate_backend(<pid>)
> or pg_ctl kill TERM <pid>, the behavior is exactly the same
> (terminated backend doesn't get the message (FATAL:  terminating
> connection due to administrator command), but it is present in
> postmaster log.
>
> > So the question seems to be why libpq isn't reporting that
> > message before it detects connection-closed.
> >
> > This triggered a vague memory, and after a bit of archives-digging,
> > I found this thread from a few months back:
> >
> >
https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac
> >
> > in which it's alleged that Windows' TCP stack is flat-out
> > broken and will drop not-yet-delivered data when the server
> > closes the connection.
> >
> > If that's true, it's pretty nasty.  Windows is about the
> > last platform where I'd want us to have behavior like this,
> > because we *will* get bug reports about it from novices.
> >
> > If there's no other workaround, I'm tempted to propose
> >
> > #ifdef WIN32
> >         pg_sleep(1 second);
> > #endif
> >
> > or something close to that, before we close the socket.
> >
>
> I can experiment with this or if something else occurred to me.
>

The above experiment suggested by you works and the test passes after
this in the local Windows setup.   One more thing, I think we don't
need to do above for graceful exits and during socket_close we have
access to exit_status_code which can be used in this context.  I have
attached the patch for this.  I think something on these lines is even
suggested by MSDN [1], but I am not sure.

I have also tried calling closesocket() explicitly in our function
socket_close which has changed the error message to "could not receive
data from server: Software caused connection abort
(0x00002745/10053)".  I am not sure if it is any better.  Another
thing I have tried to use is SO_LINGER option of socket but that
didn't work out.

> > Or we could revert the whole feature.
> >
>
> Yeah, that is also one possibility, but I think given we already have
> this behavior in existing features, it is better to either come up
> with some solution or maybe mention in docs that in such cases users
> need to check postmaster log to know the actual reason.
>
> I think we can further explore this, but for now, we might want to (a)
> revert this test, or (b) change the expected output to match.
>

I think if we decide to go with the above solution for Windows, then
we can keep the current test as it is or probably remove one test
related to <pid> test, otherwise, it is better to go with (a) for now
and once we decide any solution for this we can again commit these
tests.

[1] - https://docs.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-closesocket
See below note in the above link:
Using the closesocket or shutdown functions with SD_SEND or SD_BOTH
results in a RELEASE signal being sent out on the control channel. Due
to ATM's use of separate signal and data channels, it is possible that
a RELEASE signal could reach the remote end before the last of the
data reaches its destination, resulting in a loss of that data. One
possible solutions is programming a sufficient delay between the last
data sent and the closesocket or shutdown function calls for an ATM
socket.
Half close is not supported by ATM.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

sleep_before_socket_close_1.patch

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

29 November 2019, 16:42:31

On Fri, Nov 29, 2019 at 3:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> >
> > > * On all the failing machines, it's very clear from the postmaster
> > > log that the backend knows why it's being terminated:
> > >
> > > 2019-11-28 13:47:56.320 UTC [212024:4] 051_dropdb_force.pl FATAL:  terminating connection due to administrator
command
> > >
> >
> > Yeah, I can confirm this behavior on the Windows machines.  I have
> > also seen that we already expose this behavior in more than one way
> > and all behave similar to this.  If I use pg_terminate_backend(<pid>)
> > or pg_ctl kill TERM <pid>, the behavior is exactly the same
> > (terminated backend doesn't get the message (FATAL:  terminating
> > connection due to administrator command), but it is present in
> > postmaster log.
> >
> > > So the question seems to be why libpq isn't reporting that
> > > message before it detects connection-closed.
> > >
> > > This triggered a vague memory, and after a bit of archives-digging,
> > > I found this thread from a few months back:
> > >
> > >
https://www.postgresql.org/message-id/flat/CA%2BhUKGJowQypXSKsjws9A%2BnEQDD0-mExHZqFXtJ09N209rCO5A%40mail.gmail.com#0629f079bc59ecdaa0d6ac9f8f2c18ac
> > >
> > > in which it's alleged that Windows' TCP stack is flat-out
> > > broken and will drop not-yet-delivered data when the server
> > > closes the connection.
> > >
> > > If that's true, it's pretty nasty.  Windows is about the
> > > last platform where I'd want us to have behavior like this,
> > > because we *will* get bug reports about it from novices.
> > >
> > > If there's no other workaround, I'm tempted to propose
> > >
> > > #ifdef WIN32
> > >         pg_sleep(1 second);
> > > #endif
> > >
> > > or something close to that, before we close the socket.
> > >
> >
> > I can experiment with this or if something else occurred to me.
> >
>
> The above experiment suggested by you works and the test passes after
> this in the local Windows setup.   One more thing, I think we don't
> need to do above for graceful exits and during socket_close we have
> access to exit_status_code which can be used in this context.  I have
> attached the patch for this.  I think something on these lines is even
> suggested by MSDN [1], but I am not sure.
>

On further reading the Microsoft docs, it seems that even though it
suggests something which can work in our current situation, but it is
in a different context.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Add tests for '-f' option in dropdb utility.

From

Amit Kapila

Date:

29 November 2019, 19:40:50

On Fri, Nov 29, 2019 at 3:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 29, 2019 at 7:42 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Nov 29, 2019 at 2:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > If there's no other workaround, I'm tempted to propose
> > >
> > > #ifdef WIN32
> > >         pg_sleep(1 second);
> > > #endif
> > >
> > > or something close to that, before we close the socket.
> > >
> >
> > I can experiment with this or if something else occurred to me.
> >
>
> The above experiment suggested by you works and the test passes after
> this in the local Windows setup.   One more thing, I think we don't
> need to do above for graceful exits and during socket_close we have
> access to exit_status_code which can be used in this context.  I have
> attached the patch for this.  I think something on these lines is even
> suggested by MSDN [1], but I am not sure.
>
..
>
> > > Or we could revert the whole feature.
> > >
> >
> > Yeah, that is also one possibility, but I think given we already have
> > this behavior in existing features, it is better to either come up
> > with some solution or maybe mention in docs that in such cases users
> > need to check postmaster log to know the actual reason.
> >
> > I think we can further explore this, but for now, we might want to (a)
> > revert this test, or (b) change the expected output to match.
> >
>
> I think if we decide to go with the above solution for Windows, then
> we can keep the current test as it is or probably remove one test
> related to <pid> test, otherwise, it is better to go with (a) for now
> and once we decide any solution for this we can again commit these
> tests.
>

I am planning to revert the commits (8a7e9e9dad and 290acac92b) done
for tests tomorrow morning and then separately explore the solution
for the Windows-specific problem (it drops not-yet-delivered data when
the server closes the connection) as it seems to impact multiple
features.  It is quite possible that the best we can do is on the
lines of what you have suggested (sleep for short duration before
closing the socket), but there is no harm in spending some more time
and see if we can come up with something better.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com