Thread: pgsql: postgres_fdw: reestablish new connection if cached one is detect

pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:
postgres_fdw: reestablish new connection if cached one is detected as broken.

In postgres_fdw, once remote connections are established, they are cached
and re-used for subsequent queries and transactions. There can be some
cases where those cached connections are unavaiable, for example,
by the restart of remote server. In these cases, previously an error was
reported and the query accessing to remote server failed if new remote
transaction failed to start because the cached connection was broken.

This commit improves postgres_fdw so that new connection is remade
if broken connection is detected when starting new remote transaction.
This is useful to avoid unnecessary failure of queries when connection is
broken but can be reestablished.

Author: Bharath Rupireddy, tweaked a bit by Fujii Masao
Reviewed-by: Ashutosh Bapat, Tatsuhito Kasahara, Fujii Masao
Discussion: https://postgr.es/m/CALj2ACUAi23vf1WiHNar_LksM9EDOWXcbHCo-fD4Mbr1d=78YQ@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/32a9c0bdf493cf5fc029ab44a22384d547290eff

Modified Files
--------------
contrib/postgres_fdw/connection.c              | 54 ++++++++++++++++++++------
contrib/postgres_fdw/expected/postgres_fdw.out | 48 +++++++++++++++++++++++
contrib/postgres_fdw/sql/postgres_fdw.sql      | 41 +++++++++++++++++++
3 files changed, 131 insertions(+), 12 deletions(-)


Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Michael Paquier
Date:
Hi Fujii-san,

On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote:
> postgres_fdw: reestablish new connection if cached one is detected as broken.
>
> In postgres_fdw, once remote connections are established, they are cached
> and re-used for subsequent queries and transactions. There can be some
> cases where those cached connections are unavaiable, for example,
> by the restart of remote server. In these cases, previously an error was
> reported and the query accessing to remote server failed if new remote
> transaction failed to start because the cached connection was broken.
>
> This commit improves postgres_fdw so that new connection is remade
> if broken connection is detected when starting new remote transaction.
> This is useful to avoid unnecessary failure of queries when connection is
> broken but can be reestablished.

lorikeet is telling that the test introduced by this commit is
unstable:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36

Some details:
 BEGIN;
 SELECT 1 FROM ft1 LIMIT 1;
- ?column?
-----------
-        1
-(1 row)
-
+ERROR:  could not receive data from server: Software caused connection abort
+CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
--
Michael

Attachment

Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/07 11:13, Michael Paquier wrote:
> Hi Fujii-san,
> 
> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote:
>> postgres_fdw: reestablish new connection if cached one is detected as broken.
>>
>> In postgres_fdw, once remote connections are established, they are cached
>> and re-used for subsequent queries and transactions. There can be some
>> cases where those cached connections are unavaiable, for example,
>> by the restart of remote server. In these cases, previously an error was
>> reported and the query accessing to remote server failed if new remote
>> transaction failed to start because the cached connection was broken.
>>
>> This commit improves postgres_fdw so that new connection is remade
>> if broken connection is detected when starting new remote transaction.
>> This is useful to avoid unnecessary failure of queries when connection is
>> broken but can be reestablished.
> 
> lorikeet is telling that the test introduced by this commit is
> unstable:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36

Thanks for letting me know this!

> 
> Some details:
>   BEGIN;
>   SELECT 1 FROM ft1 LIMIT 1;
> - ?column?
> -----------
> -        1
> -(1 row)
> -
> +ERROR:  could not receive data from server: Software caused connection abort
> +CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ

This error means that new connection was successfully reestablished
after the cached connection was terminated, and then the above connection
error occurred when issuing "START TRANSACTION" command on that
new connection. There seems no suspicious relevant log messages in the
logfile. So I'm not sure why this error happened, yet.

Per the previous discusson at [1], lorikeet sometimes seems to cause
connection-relation failure in the regression test. So the cause of error
that we faced today also may be lorikeet itself.

[1]
https://www.postgresql.org/message-id/CA+hUKGL3Son9iAeqgjPbXCpU_6hhZhw9X24uNO14mOC4bG0cCA@mail.gmail.com

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/07 12:54, Fujii Masao wrote:
> 
> 
> On 2020/10/07 11:13, Michael Paquier wrote:
>> Hi Fujii-san,
>>
>> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote:
>>> postgres_fdw: reestablish new connection if cached one is detected as broken.
>>>
>>> In postgres_fdw, once remote connections are established, they are cached
>>> and re-used for subsequent queries and transactions. There can be some
>>> cases where those cached connections are unavaiable, for example,
>>> by the restart of remote server. In these cases, previously an error was
>>> reported and the query accessing to remote server failed if new remote
>>> transaction failed to start because the cached connection was broken.
>>>
>>> This commit improves postgres_fdw so that new connection is remade
>>> if broken connection is detected when starting new remote transaction.
>>> This is useful to avoid unnecessary failure of queries when connection is
>>> broken but can be reestablished.
>>
>> lorikeet is telling that the test introduced by this commit is
>> unstable:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36
> 
> Thanks for letting me know this!
> 
>>
>> Some details:
>>   BEGIN;
>>   SELECT 1 FROM ft1 LIMIT 1;
>> - ?column?
>> -----------
>> -        1
>> -(1 row)
>> -
>> +ERROR:  could not receive data from server: Software caused connection abort
>> +CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
> 
> This error means that new connection was successfully reestablished
> after the cached connection was terminated, and then the above connection
> error occurred when issuing "START TRANSACTION" command on that
> new connection. There seems no suspicious relevant log messages in the
> logfile. So I'm not sure why this error happened, yet.
> 
> Per the previous discusson at [1], lorikeet sometimes seems to cause
> connection-relation failure in the regression test. So the cause of error
> that we faced today also may be lorikeet itself.

Since it's not good to keep the buildfarm member red, I will revert
the commit unless I come up with something even after further
investigation.

My current just guess is that PQstatus(conn) doesn't indicate
CONNECTION_BAD when the above error occurs, and which
prevents new connection from being reestablished because of
the following check.

+        if (PQstatus(entry->conn) != CONNECTION_BAD ||
+            entry->xact_depth > 0 ||
+            retry_conn)
+            PG_RE_THROW();

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/07 22:25, Fujii Masao wrote:
> 
> 
> On 2020/10/07 12:54, Fujii Masao wrote:
>>
>>
>> On 2020/10/07 11:13, Michael Paquier wrote:
>>> Hi Fujii-san,
>>>
>>> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote:
>>>> postgres_fdw: reestablish new connection if cached one is detected as broken.
>>>>
>>>> In postgres_fdw, once remote connections are established, they are cached
>>>> and re-used for subsequent queries and transactions. There can be some
>>>> cases where those cached connections are unavaiable, for example,
>>>> by the restart of remote server. In these cases, previously an error was
>>>> reported and the query accessing to remote server failed if new remote
>>>> transaction failed to start because the cached connection was broken.
>>>>
>>>> This commit improves postgres_fdw so that new connection is remade
>>>> if broken connection is detected when starting new remote transaction.
>>>> This is useful to avoid unnecessary failure of queries when connection is
>>>> broken but can be reestablished.
>>>
>>> lorikeet is telling that the test introduced by this commit is
>>> unstable:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36
>>
>> Thanks for letting me know this!
>>
>>>
>>> Some details:
>>>   BEGIN;
>>>   SELECT 1 FROM ft1 LIMIT 1;
>>> - ?column?
>>> -----------
>>> -        1
>>> -(1 row)
>>> -
>>> +ERROR:  could not receive data from server: Software caused connection abort
>>> +CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
>>
>> This error means that new connection was successfully reestablished
>> after the cached connection was terminated, and then the above connection
>> error occurred when issuing "START TRANSACTION" command on that
>> new connection. There seems no suspicious relevant log messages in the
>> logfile. So I'm not sure why this error happened, yet.
>>
>> Per the previous discusson at [1], lorikeet sometimes seems to cause
>> connection-relation failure in the regression test. So the cause of error
>> that we faced today also may be lorikeet itself.
> 
> Since it's not good to keep the buildfarm member red, I will revert
> the commit unless I come up with something even after further
> investigation.
> 
> My current just guess is that PQstatus(conn) doesn't indicate
> CONNECTION_BAD when the above error occurs, and which
> prevents new connection from being reestablished because of
> the following check.
> 
> +        if (PQstatus(entry->conn) != CONNECTION_BAD ||
> +            entry->xact_depth > 0 ||
> +            retry_conn)
> +            PG_RE_THROW();

The error message in discussion is reported when recv() fails and
errno=ECONNABORTED. As far as I read the code, pqReadData() marks
the connection as CONNECTION_BAD when errno=ECONNRESET,
but not when errno=ECONNABORTED. So since PQstatus(entry->conn)
doesn't indicate CONNECTION_BAD in ECONNABORTED case,
the above check is passed through, an error is re-thrown and
new connection is not reestablished.

Therefore, the easy fix is to make libpq mark the connection as
CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.
But is it safe to do that? Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/08 0:48, Fujii Masao wrote:
> 
> 
> On 2020/10/07 22:25, Fujii Masao wrote:
>>
>>
>> On 2020/10/07 12:54, Fujii Masao wrote:
>>>
>>>
>>> On 2020/10/07 11:13, Michael Paquier wrote:
>>>> Hi Fujii-san,
>>>>
>>>> On Tue, Oct 06, 2020 at 01:52:55AM +0000, Fujii Masao wrote:
>>>>> postgres_fdw: reestablish new connection if cached one is detected as broken.
>>>>>
>>>>> In postgres_fdw, once remote connections are established, they are cached
>>>>> and re-used for subsequent queries and transactions. There can be some
>>>>> cases where those cached connections are unavaiable, for example,
>>>>> by the restart of remote server. In these cases, previously an error was
>>>>> reported and the query accessing to remote server failed if new remote
>>>>> transaction failed to start because the cached connection was broken.
>>>>>
>>>>> This commit improves postgres_fdw so that new connection is remade
>>>>> if broken connection is detected when starting new remote transaction.
>>>>> This is useful to avoid unnecessary failure of queries when connection is
>>>>> broken but can be reestablished.
>>>>
>>>> lorikeet is telling that the test introduced by this commit is
>>>> unstable:
>>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-10-06%2008%3A28%3A36
>>>
>>> Thanks for letting me know this!
>>>
>>>>
>>>> Some details:
>>>>   BEGIN;
>>>>   SELECT 1 FROM ft1 LIMIT 1;
>>>> - ?column?
>>>> -----------
>>>> -        1
>>>> -(1 row)
>>>> -
>>>> +ERROR:  could not receive data from server: Software caused connection abort
>>>> +CONTEXT:  remote SQL command: START TRANSACTION ISOLATION LEVEL REPEATABLE READ
>>>
>>> This error means that new connection was successfully reestablished
>>> after the cached connection was terminated, and then the above connection
>>> error occurred when issuing "START TRANSACTION" command on that
>>> new connection. There seems no suspicious relevant log messages in the
>>> logfile. So I'm not sure why this error happened, yet.
>>>
>>> Per the previous discusson at [1], lorikeet sometimes seems to cause
>>> connection-relation failure in the regression test. So the cause of error
>>> that we faced today also may be lorikeet itself.
>>
>> Since it's not good to keep the buildfarm member red, I will revert
>> the commit unless I come up with something even after further
>> investigation.
>>
>> My current just guess is that PQstatus(conn) doesn't indicate
>> CONNECTION_BAD when the above error occurs, and which
>> prevents new connection from being reestablished because of
>> the following check.
>>
>> +        if (PQstatus(entry->conn) != CONNECTION_BAD ||
>> +            entry->xact_depth > 0 ||
>> +            retry_conn)
>> +            PG_RE_THROW();
> 
> The error message in discussion is reported when recv() fails and
> errno=ECONNABORTED. As far as I read the code, pqReadData() marks
> the connection as CONNECTION_BAD when errno=ECONNRESET,
> but not when errno=ECONNABORTED. So since PQstatus(entry->conn)
> doesn't indicate CONNECTION_BAD in ECONNABORTED case,
> the above check is passed through, an error is re-thrown and
> new connection is not reestablished.
> 
> Therefore, the easy fix is to make libpq mark the connection as
> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.

Patch attached. This patch also changes errcode_for_socket_access()
so that it uses ERRCODE_CONNECTION_FAILURE rather than
ERRCODE_INTERNAL_ERROR as sqlerrorcode in ECONNABORTED case
like ECONNRESET. Is this sane?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment
Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>> Therefore, the easy fix is to make libpq mark the connection as
>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.

> Patch attached. This patch also changes errcode_for_socket_access()
> so that it uses ERRCODE_CONNECTION_FAILURE rather than
> ERRCODE_INTERNAL_ERROR as sqlerrorcode in ECONNABORTED case
> like ECONNRESET. Is this sane?

Some digging around on the net suggests that Windows will only return
WSAECONNRESET for the case where it gets a TCP RST packet from the remote
host.  Other cases of TCP connection loss (e.g. timeout) are reported
as WSAECONNABORTED.

I think that our code generally only expects ECONNRESET to be reported
for a lost connection.  Based on this info, we clearly need to handle
ECONNABORTED as being equivalent to that.  I don't see any reason why
we'd care to distinguish the two cases.

So it seems like (a) we'd better make sure that ECONNABORTED is
recognized anyplace that we're testing for ECONNRESET, and (b) that
seems like a back-patchable bug fix, independently of the immediate
issue.

On the other hand, it seems a bit odd that we've not recognized
this issue long since.  Maybe the preference for ECONNABORTED only
exists on a few Windows versions?

I've not checked your patch in detail.

            regards, tom lane



I wrote:
> I've not checked your patch in detail.

Bread crumb for the archives: I started a -hackers thread about this:

https://www.postgresql.org/message-id/flat/2621622.1602184554%40sss.pgh.pa.us

            regards, tom lane



Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>>> Therefore, the easy fix is to make libpq mark the connection as
>>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.

So in the wake of commit fe27009cb, this is what lorikeet is doing:

@@ -9028,9 +9028,7 @@
 CALL terminate_backend_and_wait('fdw_retry_check');
 SAVEPOINT s;
 SELECT 1 FROM ft1 LIMIT 1;    -- should fail
-ERROR:  server closed the connection unexpectedly
-    This probably means the server terminated abnormally
-    before or while processing the request.
+ERROR:  could not receive data from server: Software caused connection abort
 CONTEXT:  remote SQL command: SAVEPOINT s2
 COMMIT;
 -- Clean up

which is better --- the connection recovery is working --- but
obviously still not a "pass".

The only way to make this test pass as-is is to hack libpq so that
it deems ECONNABORTED to be a server crash, which frankly I do not
think is a good change.  I don't see any principled reason to think
that it means that rather than a network connectivity failure.

What I've done instead as a stopgap is to adjust the test case to be
insensitive to the exact error message text.  Maybe there's a better
way, but I can't think of anything besides having two (or more?)
expected-output files.  That would be quite unpalatable as things
stand, though perhaps we could make it tolerable by splitting this
test off into a second test script.

Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
happy with it:

* The control flow seems rather forced.  I think it was designed
on the assumption that reindenting the existing code is forbidden.
Why not use an actual loop, instead of a goto?  I also think that
it's far from obvious that the loop isn't an infinite loop; it
took me quite a while to glom onto the fact that the test inside the
PG_CATCH avoids that by checking retry_conn.  Maybe a comment would
be enough to improve that, but I feel the control logic could stand
a rethink.

* The CATCH case is completely failing to clean up after itself.
At the minimum, there has to be a FlushErrorState() there.
I don't think it'd be terribly hard to drive this code to an
error-stack-overflow PANIC.

* As is so often true of proposed patches in which PG_CATCH is
thought to be an adequate way to catch an error, this is just
unbelievably fragile.  It will break, and badly so, if it catches
an error that is anything other than what it is expecting ...
and it's not even particularly trying to verify that the error is
what it's expecting.  It might be okay, or at least a little bit
harder to break, if it verified that the error's SQLSTATE is
ERRCODE_CONNECTION_FAILURE.

            regards, tom lane



Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/11 9:16, Tom Lane wrote:
> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>>>> Therefore, the easy fix is to make libpq mark the connection as
>>>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.
> 
> So in the wake of commit fe27009cb,

Many thanks for the commit fe27009cb !!


> this is what lorikeet is doing:
> 
> @@ -9028,9 +9028,7 @@
>   CALL terminate_backend_and_wait('fdw_retry_check');
>   SAVEPOINT s;
>   SELECT 1 FROM ft1 LIMIT 1;    -- should fail
> -ERROR:  server closed the connection unexpectedly
> -    This probably means the server terminated abnormally
> -    before or while processing the request.
> +ERROR:  could not receive data from server: Software caused connection abort
>   CONTEXT:  remote SQL command: SAVEPOINT s2
>   COMMIT;
>   -- Clean up
> 
> which is better --- the connection recovery is working --- but
> obviously still not a "pass".
> 
> The only way to make this test pass as-is is to hack libpq so that
> it deems ECONNABORTED to be a server crash, which frankly I do not
> think is a good change.  I don't see any principled reason to think
> that it means that rather than a network connectivity failure.
> 
> What I've done instead as a stopgap is to adjust the test case to be
> insensitive to the exact error message text.

For now I've not come up with better idea than current this fix.


> Maybe there's a better
> way, but I can't think of anything besides having two (or more?)
> expected-output files.  That would be quite unpalatable as things
> stand, though perhaps we could make it tolerable by splitting this
> test off into a second test script.
> 
> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
> happy with it:
> 
> * The control flow seems rather forced.  I think it was designed
> on the assumption that reindenting the existing code is forbidden.
> Why not use an actual loop, instead of a goto?  I also think that
> it's far from obvious that the loop isn't an infinite loop; it
> took me quite a while to glom onto the fact that the test inside the
> PG_CATCH avoids that by checking retry_conn.  Maybe a comment would
> be enough to improve that, but I feel the control logic could stand
> a rethink.

Isn't it simpler and easier-to-read to just reestablish new connection again
in the retry case instead of using a loop because we retry only once?
Patch attached.


> 
> * The CATCH case is completely failing to clean up after itself.
> At the minimum, there has to be a FlushErrorState() there.
> I don't think it'd be terribly hard to drive this code to an
> error-stack-overflow PANIC.

You're right. Sorry I forgot to take into consideration this :(
I fixed this issue in the attached patch.


> 
> * As is so often true of proposed patches in which PG_CATCH is
> thought to be an adequate way to catch an error, this is just
> unbelievably fragile.  It will break, and badly so, if it catches
> an error that is anything other than what it is expecting ...
> and it's not even particularly trying to verify that the error is
> what it's expecting.  It might be okay, or at least a little bit
> harder to break, if it verified that the error's SQLSTATE is
> ERRCODE_CONNECTION_FAILURE.

"PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough?
Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE
is checked.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment
Fujii Masao <masao.fujii@oss.nttdata.com> writes:
> On 2020/10/11 9:16, Tom Lane wrote:
>> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
>> happy with it:
>>
>> * The control flow seems rather forced.  I think it was designed
>> on the assumption that reindenting the existing code is forbidden.

> Isn't it simpler and easier-to-read to just reestablish new connection again
> in the retry case instead of using a loop because we retry only once?
> Patch attached.

Yeah, that seems better.

>> * As is so often true of proposed patches in which PG_CATCH is
>> thought to be an adequate way to catch an error, this is just
>> unbelievably fragile.  It will break, and badly so, if it catches
>> an error that is anything other than what it is expecting ...
>> and it's not even particularly trying to verify that the error is
>> what it's expecting.  It might be okay, or at least a little bit
>> harder to break, if it verified that the error's SQLSTATE is
>> ERRCODE_CONNECTION_FAILURE.

> "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough?
> Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE
> is checked.

The reason I'm concerned about this is that we could potentially throw an
elog(ERROR) somewhere between return from libpq and the expected ereport
call in pgfdw_report_error.  It wouldn't be wise to ignore such a
condition (it might be out-of-memory or some such).

Personally I'd code the check with explicit tests for *both*
PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE,
rather than just asserting that the latter must imply the former.
By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE
for any libpq-originated error condition, but not all of those will
set CONNECTION_BAD.

Other than that, this seems reasonable by eyeball (I didn't do any
testing).

            regards, tom lane



Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/14 3:34, Tom Lane wrote:
> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>> On 2020/10/11 9:16, Tom Lane wrote:
>>> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
>>> happy with it:
>>>
>>> * The control flow seems rather forced.  I think it was designed
>>> on the assumption that reindenting the existing code is forbidden.
> 
>> Isn't it simpler and easier-to-read to just reestablish new connection again
>> in the retry case instead of using a loop because we retry only once?
>> Patch attached.
> 
> Yeah, that seems better.
> 
>>> * As is so often true of proposed patches in which PG_CATCH is
>>> thought to be an adequate way to catch an error, this is just
>>> unbelievably fragile.  It will break, and badly so, if it catches
>>> an error that is anything other than what it is expecting ...
>>> and it's not even particularly trying to verify that the error is
>>> what it's expecting.  It might be okay, or at least a little bit
>>> harder to break, if it verified that the error's SQLSTATE is
>>> ERRCODE_CONNECTION_FAILURE.
> 
>> "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough?
>> Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE
>> is checked.
> 
> The reason I'm concerned about this is that we could potentially throw an
> elog(ERROR) somewhere between return from libpq and the expected ereport
> call in pgfdw_report_error.  It wouldn't be wise to ignore such a
> condition (it might be out-of-memory or some such).

Understood.


> Personally I'd code the check with explicit tests for *both*
> PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE,
> rather than just asserting that the latter must imply the former.
> By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE
> for any libpq-originated error condition, but not all of those will
> set CONNECTION_BAD.

Yes, this makes sense. I updated the patch so that both sqlstate and
PQstatus() are checked. Patch attached.


> Other than that, this seems reasonable by eyeball (I didn't do any
> testing).

Thanks! Barring any objection, I will commit it.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect

From
Fujii Masao
Date:

On 2020/10/15 16:21, Fujii Masao wrote:
> 
> 
> On 2020/10/14 3:34, Tom Lane wrote:
>> Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>>> On 2020/10/11 9:16, Tom Lane wrote:
>>>> Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
>>>> happy with it:
>>>>
>>>> * The control flow seems rather forced.  I think it was designed
>>>> on the assumption that reindenting the existing code is forbidden.
>>
>>> Isn't it simpler and easier-to-read to just reestablish new connection again
>>> in the retry case instead of using a loop because we retry only once?
>>> Patch attached.
>>
>> Yeah, that seems better.
>>
>>>> * As is so often true of proposed patches in which PG_CATCH is
>>>> thought to be an adequate way to catch an error, this is just
>>>> unbelievably fragile.  It will break, and badly so, if it catches
>>>> an error that is anything other than what it is expecting ...
>>>> and it's not even particularly trying to verify that the error is
>>>> what it's expecting.  It might be okay, or at least a little bit
>>>> harder to break, if it verified that the error's SQLSTATE is
>>>> ERRCODE_CONNECTION_FAILURE.
>>
>>> "PQstatus()==CONNECTION_BAD" was checked for that purpose. But that's not enough?
>>> Anyway, in the patch, I changed the code so that sqlstate==ERRCODE_CONNECTION_FAILURE
>>> is checked.
>>
>> The reason I'm concerned about this is that we could potentially throw an
>> elog(ERROR) somewhere between return from libpq and the expected ereport
>> call in pgfdw_report_error.  It wouldn't be wise to ignore such a
>> condition (it might be out-of-memory or some such).
> 
> Understood.
> 
> 
>> Personally I'd code the check with explicit tests for *both*
>> PQstatus()==CONNECTION_BAD and sqlstate==ERRCODE_CONNECTION_FAILURE,
>> rather than just asserting that the latter must imply the former.
>> By my reading, pgfdw_report_error will report ERRCODE_CONNECTION_FAILURE
>> for any libpq-originated error condition, but not all of those will
>> set CONNECTION_BAD.
> 
> Yes, this makes sense. I updated the patch so that both sqlstate and
> PQstatus() are checked. Patch attached.
> 
> 
>> Other than that, this seems reasonable by eyeball (I didn't do any
>> testing).
> 
> Thanks! Barring any objection, I will commit it.

Pushed. Thanks!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION