Thread: Simplify backend terminate and wait logic in postgres_fdw test
Hi, With the recent commit aaf0432572 which introduced a waiting/timeout capability for pg_teriminate_backend function, I would like to do $subject. Attaching a patch, please have a look. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote: > With the recent commit aaf0432572 which introduced a waiting/timeout > capability for pg_teriminate_backend function, I would like to do > $subject. Attaching a patch, please have a look. +-- Terminate the remote backend having the specified application_name and wait +-- for the termination to complete. 10 seconds timeout here is chosen randomly, +-- we will see a warning if the process doesn't go away within that time. +SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity + WHERE application_name = 'fdw_retry_check'; I think that you are making the tests less stable by doing that. A couple of buildfarm machines are very slow, and 10 seconds would not be enough. So it seems to me that this patch is trading what is a stable solution for a solution that may finish by randomly bite. -- Michael
Attachment
On Thu, Apr 8, 2021 at 5:28 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote: > > With the recent commit aaf0432572 which introduced a waiting/timeout > > capability for pg_teriminate_backend function, I would like to do > > $subject. Attaching a patch, please have a look. > > +-- Terminate the remote backend having the specified application_name and wait > +-- for the termination to complete. 10 seconds timeout here is chosen randomly, > +-- we will see a warning if the process doesn't go away within that time. > +SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity > + WHERE application_name = 'fdw_retry_check'; > > I think that you are making the tests less stable by doing that. A > couple of buildfarm machines are very slow, and 10 seconds would not > be enough. So it seems to me that this patch is trading what is a > stable solution for a solution that may finish by randomly bite. Agree. Please see the attached patch, I removed a fixed waiting time. Instead of relying on pg_stat_activity, pg_sleep and pg_stat_clear_snapshot, now it depends on pg_terminate_backend and pg_wait_for_backend_termination. This way we could reduce the functions that the procedure terminate_backend_and_wait uses and also the new functions pg_terminate_backend and pg_wait_for_backend_termination gets a test coverage. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Apr 08, 2021 at 06:27:56PM +0530, Bharath Rupireddy wrote: > Agree. Please see the attached patch, I removed a fixed waiting time. > Instead of relying on pg_stat_activity, pg_sleep and > pg_stat_clear_snapshot, now it depends on pg_terminate_backend and > pg_wait_for_backend_termination. This way we could reduce the > functions that the procedure terminate_backend_and_wait uses and also > the new functions pg_terminate_backend and > pg_wait_for_backend_termination gets a test coverage. + EXIT WHEN is_terminated; + SELECT * INTO is_terminated FROM pg_wait_for_backend_termination(pid_v, 1000); This is still a regression if the termination takes more than 1s, no? In such a case terminate_backend_and_wait() would issue a WARNING and pollute the regression test output. I can see the point of what you are achieving here, and that's a good idea, but from the point of view of the buildfarm the WARNING introduced by aaf0432 is a no-go. I honestly don't quite get the benefit in issuing a WARNING in this case anyway, as the code already returns false on timeout so as caller would know the status of the operation. -- Michael
Attachment
On Fri, Apr 9, 2021 at 5:51 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Apr 08, 2021 at 06:27:56PM +0530, Bharath Rupireddy wrote: > > Agree. Please see the attached patch, I removed a fixed waiting time. > > Instead of relying on pg_stat_activity, pg_sleep and > > pg_stat_clear_snapshot, now it depends on pg_terminate_backend and > > pg_wait_for_backend_termination. This way we could reduce the > > functions that the procedure terminate_backend_and_wait uses and also > > the new functions pg_terminate_backend and > > pg_wait_for_backend_termination gets a test coverage. > > + EXIT WHEN is_terminated; > + SELECT * INTO is_terminated FROM pg_wait_for_backend_termination(pid_v, 1000); > This is still a regression if the termination takes more than 1s, > no? In such a case terminate_backend_and_wait() would issue a WARNING > and pollute the regression test output. I can see the point of what > you are achieving here, and that's a good idea, but from the point of > view of the buildfarm the WARNING introduced by aaf0432 is a no-go. I didn't think of the warning cases, my bad. How about using SET client_min_messages = 'ERROR'; before we call pg_wait_for_backend_termination? We can only depend on the return value of pg_wait_for_backend_termination, when true we can exit. This way the buildfarm will not see warnings. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote: > I didn't think of the warning cases, my bad. How about using SET > client_min_messages = 'ERROR'; before we call > pg_wait_for_backend_termination? We can only depend on the return > value of pg_wait_for_backend_termination, when true we can exit. This > way the buildfarm will not see warnings. Thoughts? You could do that, but I would also bet that this is going to get forgotten in the future if this gets extended in more SQL tests that are output-sensitive, in or out of core. Honestly, I can get behind a warning in pg_wait_for_backend_termination() to inform that the process poked at is not a PostgreSQL one, because it offers new and useful information to the user. But, and my apologies for sounding a bit noisy, I really don't get why pg_wait_until_termination() has any need to do that. From what I can see, it provides the following information: - A PID, that we already know from the caller or just from pg_stat_activity. - A timeout, already known as well. - The fact that the process did not terminate, information given by the "false" status, only used in this case. So there is no new information here to the user, only a duplicate of what's already known to the caller of this function. I see more advantages in removing this WARNING rather than keeping it. -- Michael
Attachment
At Fri, 9 Apr 2021 10:59:44 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote: > > I didn't think of the warning cases, my bad. How about using SET > > client_min_messages = 'ERROR'; before we call > > pg_wait_for_backend_termination? We can only depend on the return > > value of pg_wait_for_backend_termination, when true we can exit. This > > way the buildfarm will not see warnings. Thoughts? > > You could do that, but I would also bet that this is going to get > forgotten in the future if this gets extended in more SQL tests that > are output-sensitive, in or out of core. Honestly, I can get behind a > warning in pg_wait_for_backend_termination() to inform that the > process poked at is not a PostgreSQL one, because it offers new and > useful information to the user. But, and my apologies for sounding a > bit noisy, I really don't get why pg_wait_until_termination() has any > need to do that. From what I can see, it provides the following > information: > - A PID, that we already know from the caller or just from > pg_stat_activity. > - A timeout, already known as well. > - The fact that the process did not terminate, information given by > the "false" status, only used in this case. > > So there is no new information here to the user, only a duplicate of > what's already known to the caller of this function. I see more > advantages in removing this WARNING rather than keeping it. FWIW I agree to Michael. I faintly remember that I thought the same while reviewing but it seems that I forgot to write a comment like that. It's a work of the caller, concretely the existing callers and any possible script that calls the function. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Apr 9, 2021 at 7:29 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Apr 09, 2021 at 06:53:21AM +0530, Bharath Rupireddy wrote: > > I didn't think of the warning cases, my bad. How about using SET > > client_min_messages = 'ERROR'; before we call > > pg_wait_for_backend_termination? We can only depend on the return > > value of pg_wait_for_backend_termination, when true we can exit. This > > way the buildfarm will not see warnings. Thoughts? > > You could do that, but I would also bet that this is going to get > forgotten in the future if this gets extended in more SQL tests that > are output-sensitive, in or out of core. Honestly, I can get behind a > warning in pg_wait_for_backend_termination() to inform that the > process poked at is not a PostgreSQL one, because it offers new and > useful information to the user. But, and my apologies for sounding a > bit noisy, I really don't get why pg_wait_until_termination() has any > need to do that. From what I can see, it provides the following > information: > - A PID, that we already know from the caller or just from > pg_stat_activity. > - A timeout, already known as well. > - The fact that the process did not terminate, information given by > the "false" status, only used in this case. > > So there is no new information here to the user, only a duplicate of > what's already known to the caller of this function. I see more > advantages in removing this WARNING rather than keeping it. IMO it does make sense to provide a warning for a bool returning function, if there are multiple situations in which the function returns false. This will give clear information as to why the false is returned. pg_terminate_backend: false is returned 1) when the process with given pid is not a backend(warning "PID %d is not a PostgreSQL server process") 2) if the kill() fails(warning "could not send signal to process %d: %m") 3) if the timeout is specified and the backend is not terminated within it(warning "backend with PID %d did not terminate within %lld milliseconds"). pg_cancel_backend: false is returned 1) when the process with the given pid is not a backend 2) if the kill() fails. pg_wait_for_backend_termination: false is returned 1) when the process with a given pid is not a backend 2) the backend is not terminated within the timeout. If we ensure that all the above functions return false in only one situation and error in all other situations, then removing warnings makes sense. Having said above, there seems to be a reason for issuing a warning and returning false instead of error, that is the callers can just call these functions in a loop until they return true. See the below comments: /* * This is just a warning so a loop-through-resultset will not abort * if one backend terminated on its own during the run. */ /* Again, just a warning to allow loops */ I would like to keep the behaviour of these functions as-is. > You could do that, but I would also bet that this is going to get > forgotten in the future if this gets extended in more SQL tests that > are output-sensitive, in or out of core On the above point of hackers (wanting to use these functions in more SQL tests) forgetting that the functions pg_terminate_backend, pg_cancel_backend, pg_wait_for_backend_termination issue a warning in some cases which might pollute the tests if used without suppressing these warnings, I feel it is best left to the patch implementers and the reviewers. On our part, we mentioned that the functions pg_terminate_backend and pg_wait_for_backend_termination would emit a warning on timeout "On timeout a warning is emitted and <literal>false</literal> is returned" With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 8, 2021 at 6:27 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Apr 8, 2021 at 5:28 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Apr 08, 2021 at 04:55:22PM +0530, Bharath Rupireddy wrote: > > > With the recent commit aaf0432572 which introduced a waiting/timeout > > > capability for pg_teriminate_backend function, I would like to do > > > $subject. Attaching a patch, please have a look. > > > > +-- Terminate the remote backend having the specified application_name and wait > > +-- for the termination to complete. 10 seconds timeout here is chosen randomly, > > +-- we will see a warning if the process doesn't go away within that time. > > +SELECT pg_terminate_backend(pid, 10000) FROM pg_stat_activity > > + WHERE application_name = 'fdw_retry_check'; > > > > I think that you are making the tests less stable by doing that. A > > couple of buildfarm machines are very slow, and 10 seconds would not > > be enough. So it seems to me that this patch is trading what is a > > stable solution for a solution that may finish by randomly bite. > > Agree. Please see the attached patch, I removed a fixed waiting time. > Instead of relying on pg_stat_activity, pg_sleep and > pg_stat_clear_snapshot, now it depends on pg_terminate_backend and > pg_wait_for_backend_termination. This way we could reduce the > functions that the procedure terminate_backend_and_wait uses and also > the new functions pg_terminate_backend and > pg_wait_for_backend_termination gets a test coverage. > > Thoughts? I realized that the usage like below in v2 patch is completely wrong, because pg_terminate_backend without timeout will return true and the loop exits without calling pg_wait_for_backend_terminatioin. Sorry for not realizing this earlier. SELECT * INTO is_terminated FROM pg_terminate_backend(pid_v); LOOP EXIT WHEN is_terminated; SELECT * INTO is_terminated FROM pg_wait_for_backend_termination(pid_v, 1000); END LOOP; I feel that we can provide a high timeout value (It can be 1hr on the similar lines of using pg_sleep(3600) for crash tests in 013_crash_restart.pl with the assumption that the backend running that command will get killed with SIGQUIT) and make pg_terminate_backend wait. This unreasonably high timeout looks okay because of the assumption that the servers in the build farm will not take that much time to remove the backend from the system processes, so the function will return much earlier than that. If at all there's a server(which is impractical IMO) that doesn't remove the backend process even within 1hr, then that is anyways will fail with the warning. With the attached patch, we could remove the procedure terminate_backend_and_wait altogether. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Apr 09, 2021 at 04:53:01PM +0530, Bharath Rupireddy wrote: > I feel that we can provide a high timeout value (It can be 1hr on the > similar lines of using pg_sleep(3600) for crash tests in > 013_crash_restart.pl with the assumption that the backend running that > command will get killed with SIGQUIT) and make pg_terminate_backend > wait. This unreasonably high timeout looks okay because of the > assumption that the servers in the build farm will not take that much > time to remove the backend from the system processes, so the function > will return much earlier than that. If at all there's a server(which > is impractical IMO) that doesn't remove the backend process even > within 1hr, then that is anyways will fail with the warning. You may not need a value as large as 1h for that :) Looking at the TAP tests, some areas have been living with timeouts of up to 180s. It is a matter of balance here, a timeout too long would be annoying as it would make the detection of a problem longer for machines that are stuck, and a too short value generates false positives. 5 minutes gives some balance, but there is really no perfect value. > With the attached patch, we could remove the procedure > terminate_backend_and_wait altogether. Thoughts? That's clearly better, and logically it would work. As those tests are new in 14, it may be a good idea to cleanup all that so as all the branches have the same set of tests. Would people object to that? -- Michael
Attachment
On Mon, Apr 12, 2021 at 11:18 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Apr 09, 2021 at 04:53:01PM +0530, Bharath Rupireddy wrote: > > I feel that we can provide a high timeout value (It can be 1hr on the > > similar lines of using pg_sleep(3600) for crash tests in > > 013_crash_restart.pl with the assumption that the backend running that > > command will get killed with SIGQUIT) and make pg_terminate_backend > > wait. This unreasonably high timeout looks okay because of the > > assumption that the servers in the build farm will not take that much > > time to remove the backend from the system processes, so the function > > will return much earlier than that. If at all there's a server(which > > is impractical IMO) that doesn't remove the backend process even > > within 1hr, then that is anyways will fail with the warning. > > You may not need a value as large as 1h for that :) > > Looking at the TAP tests, some areas have been living with timeouts of > up to 180s. It is a matter of balance here, a timeout too long would > be annoying as it would make the detection of a problem longer for > machines that are stuck, and a too short value generates false > positives. 5 minutes gives some balance, but there is really no > perfect value. I changed to 5min. If at all there's any server that would take more than 5min to remove a process from the system processes list, then it would see a warning on timeout. > > With the attached patch, we could remove the procedure > > terminate_backend_and_wait altogether. Thoughts? > > That's clearly better, and logically it would work. As those tests > are new in 14, it may be a good idea to cleanup all that so as all the > branches have the same set of tests. Would people object to that? Yes, these tests are introduced in v14, +1 to clean them with this patch on v14 as well along with master. Attaching v4, please review further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 12, 2021 at 11:29:28AM +0530, Bharath Rupireddy wrote: > I changed to 5min. If at all there's any server that would take more > than 5min to remove a process from the system processes list, then it > would see a warning on timeout. Looks fine to me. Let's wait a bit first to see if Fujii-san has any objections to this cleanup as that's his code originally, from 32a9c0bd. -- Michael
Attachment
On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote: > Looks fine to me. Let's wait a bit first to see if Fujii-san has any > objections to this cleanup as that's his code originally, from > 32a9c0bd. And hearing nothing, done. The tests of postgres_fdw are getting much faster for me now, from basically 6s to 4s (actually that's roughly 1.8s of gain as pg_wait_until_termination waits at least 100ms, twice), so that's a nice gain. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote: >> Looks fine to me. Let's wait a bit first to see if Fujii-san has any >> objections to this cleanup as that's his code originally, from >> 32a9c0bd. > And hearing nothing, done. The tests of postgres_fdw are getting much > faster for me now, from basically 6s to 4s (actually that's roughly > 1.8s of gain as pg_wait_until_termination waits at least 100ms, > twice), so that's a nice gain. The buildfarm is showing that one of these test queries is not stable under CLOBBER_CACHE_ALWAYS: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47 of which the relevant part is: diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out --- /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out 2021-05-01 03:44:45.022300613-0400 +++ /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out 2021-05-03 09:11:24.051379288-0400 @@ -9215,8 +9215,7 @@ WHERE application_name = 'fdw_retry_check'; pg_terminate_backend ---------------------- - t -(1 row) +(0 rows) -- This query should detect the broken connection when starting new remote -- transaction, reestablish new connection, and then succeed. I can reproduce that locally by setting alter system set debug_invalidate_system_caches_always = 1; and running "make installcheck" in contrib/postgres_fdw. (It takes a good long time to run the whole test script though, so you might want to see if running just these few queries is enough.) There's no evidence of distress in the postmaster log, so I suspect this might just be a timing instability, e.g. remote process already gone before local process looks. If so, it's probably hopeless to make this test stable as-is. Perhaps we should just take it out. regards, tom lane
On Tue, May 4, 2021 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The buildfarm is showing that one of these test queries is not stable > under CLOBBER_CACHE_ALWAYS: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-05-01%2007%3A44%3A47 > > of which the relevant part is: > > diff -U3 /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out > --- /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out 2021-05-01 03:44:45.022300613-0400 > +++ /home/buildfarm/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out 2021-05-03 09:11:24.051379288-0400 > @@ -9215,8 +9215,7 @@ > WHERE application_name = 'fdw_retry_check'; > pg_terminate_backend > ---------------------- > - t > -(1 row) > +(0 rows) > > -- This query should detect the broken connection when starting new remote > -- transaction, reestablish new connection, and then succeed. Thanks for the report. > I can reproduce that locally by setting > > alter system set debug_invalidate_system_caches_always = 1; > > and running "make installcheck" in contrib/postgres_fdw. > (It takes a good long time to run the whole test script > though, so you might want to see if running just these few > queries is enough.) I can reproduce the issue with the failing case. Issue is that the backend pid will be null in the pg_stat_activity because of the cache invalidation that happens at the beginning of the query and hence pg_terminate_backend returns null on null input. > There's no evidence of distress in the postmaster log, > so I suspect this might just be a timing instability, > e.g. remote process already gone before local process > looks. If so, it's probably hopeless to make this > test stable as-is. Perhaps we should just take it out. Actually, that test case covers retry code, so removing it worries me. Instead, I can do as attached i.e. ignore the pg_terminate_backend output using PERFORM, as the function signals the backend if the given pid is a valid backend pid and returns on success. If at all, the function is to return false, it emits a warning, so it will be caught in the tests. And having a retry test case with clobber cache enabled doesn't make sense because all the cache entries are removed/invalidated for each query, but the test case covers the code on non-clobber cache platforms, so I would like to keep it. Please see the attached, it passes with "alter system set debug_invalidate_system_caches_always = 1;" on my system. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, May 04, 2021 at 12:43:53PM +0530, Bharath Rupireddy wrote: > And having a retry test case with clobber cache enabled doesn't make > sense because all the cache entries are removed/invalidated for each > query, but the test case covers the code on non-clobber cache > platforms, so I would like to keep it. Yeah, I'd rather keep this test around as it is specific to connection caches, and it is not time-consuming on fast machines in its new shape either. Another trick we could use here could be an aggregate checking for the number of rows returned, say: SELECT count(pg_terminate_backend(pid, 180000)) >= 0 FROM pg_stat_activity WHERE application_name = 'fdw_retry_check'; But using CALL as you are suggesting is much cleaner. (Worth noting that I am out this week for Golden Week, so if this can wait until Monday, that would be nice. I am not willing to take my chances with the buildfarm now :p) -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > (Worth noting that I am out this week for Golden Week, so if this can > wait until Monday, that would be nice. I am not willing to take my > chances with the buildfarm now :p) I will see to it. I think it's important to get a fix in in the next couple of days, because hyrax has not had a clean run in six weeks. That animal takes almost a week per test cycle, so the next HEAD run it starts (two or three days from now) is about our last chance to get it to go green before beta1 wrap. I feel it's fairly urgent to try to do that, because who knows if any other cache-clobber issues snuck in just before feature freeze. regards, tom lane
On Tue, May 4, 2021 at 7:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Michael Paquier <michael@paquier.xyz> writes: > > (Worth noting that I am out this week for Golden Week, so if this can > > wait until Monday, that would be nice. I am not willing to take my > > chances with the buildfarm now :p) > > I will see to it. I think it's important to get a fix in in the next > couple of days, because hyrax has not had a clean run in six weeks. > That animal takes almost a week per test cycle, so the next HEAD run > it starts (two or three days from now) is about our last chance to > get it to go green before beta1 wrap. I feel it's fairly urgent to > try to do that, because who knows if any other cache-clobber issues > snuck in just before feature freeze. Thanks! Can we then take the patch proposed at [1]? [1] - https://www.postgresql.org/message-id/CALj2ACWqh2nHzPyzP-bAY%2BCaAAbtQRO55AQ_4ppGiU_w8iOvTg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > On Tue, May 4, 2021 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The buildfarm is showing that one of these test queries is not stable >> under CLOBBER_CACHE_ALWAYS: > I can reproduce the issue with the failing case. Issue is that the > backend pid will be null in the pg_stat_activity because of the cache > invalidation that happens at the beginning of the query and hence > pg_terminate_backend returns null on null input. No, that's nonsense: if it were happening that way, the query would return one row with a NULL result, but actually it's returning no rows. What's actually happening, it seems, is that because pgfdw_inval_callback is constantly getting called due to cache flushes, we invariably drop remote connections immediately during transaction termination (cf pgfdw_xact_callback). Thus, by the time we inspect pg_stat_activity, there is no remote session to terminate. I don't like your patch because what it effectively does is mask whether termination happened or not; if there were a bug there causing that not to happen, the test would still appear to pass. I think the most expedient fix, if we want to keep this test, is just to transiently disable debug_invalidate_system_caches_always. (That option wasn't available before v14, but fortunately we don't need a fix for the back branches.) I believe the attached will do the trick, but I'm running the test with debug_invalidate_system_caches_always turned on to verify that. Should be done in an hour or so... regards, tom lane diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 8e1cc69508..6f533c745d 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -9204,6 +9204,12 @@ WARNING: there is no transaction in progress -- Change application_name of remote connection to special one -- so that we can easily terminate the connection later. ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check'); +-- If debug_invalidate_system_caches_always is active, it results in +-- dropping remote connections after every transaction, making it +-- impossible to test termination meaningfully. So turn that off +-- for this test. +SET debug_invalidate_system_caches_always = 0; +-- Make sure we have a remote connection. SELECT 1 FROM ft1 LIMIT 1; ?column? ---------- @@ -9227,9 +9233,8 @@ SELECT 1 FROM ft1 LIMIT 1; 1 (1 row) --- If the query detects the broken connection when starting new remote --- subtransaction, it doesn't reestablish new connection and should fail. --- The text of the error might vary across platforms, so don't show it. +-- If we detect the broken connection when starting a new remote +-- subtransaction, we should fail instead of establishing a new connection. -- Terminate the remote connection and wait for the termination to complete. SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity WHERE application_name = 'fdw_retry_check'; @@ -9239,11 +9244,13 @@ SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity (1 row) SAVEPOINT s; +-- The text of the error might vary across platforms, so only show SQLSTATE. \set VERBOSITY sqlstate SELECT 1 FROM ft1 LIMIT 1; -- should fail ERROR: 08006 \set VERBOSITY default COMMIT; +RESET debug_invalidate_system_caches_always; -- ============================================================================= -- test connection invalidation cases and postgres_fdw_get_connections function -- ============================================================================= diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql index dcd36a9753..000e2534fc 100644 --- a/contrib/postgres_fdw/sql/postgres_fdw.sql +++ b/contrib/postgres_fdw/sql/postgres_fdw.sql @@ -2795,6 +2795,14 @@ ROLLBACK; -- Change application_name of remote connection to special one -- so that we can easily terminate the connection later. ALTER SERVER loopback OPTIONS (application_name 'fdw_retry_check'); + +-- If debug_invalidate_system_caches_always is active, it results in +-- dropping remote connections after every transaction, making it +-- impossible to test termination meaningfully. So turn that off +-- for this test. +SET debug_invalidate_system_caches_always = 0; + +-- Make sure we have a remote connection. SELECT 1 FROM ft1 LIMIT 1; -- Terminate the remote connection and wait for the termination to complete. @@ -2806,18 +2814,20 @@ SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity BEGIN; SELECT 1 FROM ft1 LIMIT 1; --- If the query detects the broken connection when starting new remote --- subtransaction, it doesn't reestablish new connection and should fail. --- The text of the error might vary across platforms, so don't show it. +-- If we detect the broken connection when starting a new remote +-- subtransaction, we should fail instead of establishing a new connection. -- Terminate the remote connection and wait for the termination to complete. SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity WHERE application_name = 'fdw_retry_check'; SAVEPOINT s; +-- The text of the error might vary across platforms, so only show SQLSTATE. \set VERBOSITY sqlstate SELECT 1 FROM ft1 LIMIT 1; -- should fail \set VERBOSITY default COMMIT; +RESET debug_invalidate_system_caches_always; + -- ============================================================================= -- test connection invalidation cases and postgres_fdw_get_connections function -- =============================================================================
On Tue, May 4, 2021 at 9:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes: > > On Tue, May 4, 2021 at 4:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The buildfarm is showing that one of these test queries is not stable > >> under CLOBBER_CACHE_ALWAYS: > > > I can reproduce the issue with the failing case. Issue is that the > > backend pid will be null in the pg_stat_activity because of the cache > > invalidation that happens at the beginning of the query and hence > > pg_terminate_backend returns null on null input. > > No, that's nonsense: if it were happening that way, the query would > return one row with a NULL result, but actually it's returning no > rows. What's actually happening, it seems, is that because > pgfdw_inval_callback is constantly getting called due to cache > flushes, we invariably drop remote connections immediately during > transaction termination (cf pgfdw_xact_callback). Thus, by the time > we inspect pg_stat_activity, there is no remote session to terminate. > > I don't like your patch because what it effectively does is mask > whether termination happened or not; if there were a bug there > causing that not to happen, the test would still appear to pass. > > I think the most expedient fix, if we want to keep this test, is > just to transiently disable debug_invalidate_system_caches_always. > (That option wasn't available before v14, but fortunately we > don't need a fix for the back branches.) > > I believe the attached will do the trick, but I'm running the test > with debug_invalidate_system_caches_always turned on to verify > that. Should be done in an hour or so... Thanks for pushing this change. If debug_invalidate_system_caches_always is allowed to be used for cache sensitive test cases, I see an opportunity to make the tests, that are adjusted by commit f77717b29, more meaningful as they were before the commit. That commit changed the way below functions show up output in the tests: SELECT 1 FROM postgres_fdw_disconnect_all(); SELECT server_name FROM postgres_fdw_get_connections() ORDER BY 1; If okay, I can work on it (not for PG14 of course). It can be discussed in a separate thread though. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Tue, May 04, 2021 at 11:38:09AM -0400, Tom Lane wrote: > I believe the attached will do the trick, but I'm running the test > with debug_invalidate_system_caches_always turned on to verify > that. Should be done in an hour or so... Thanks for taking care of that! -- Michael