Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect
Date
Msg-id 2943611.1602375376@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: postgres_fdw: reestablish new connection if cached one is detect  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-committers
Fujii Masao <masao.fujii@oss.nttdata.com> writes:
>>> Therefore, the easy fix is to make libpq mark the connection as
>>> CONNECTION_BAD even in ECONNABORTED, like we do in ECONNRESET.

So in the wake of commit fe27009cb, this is what lorikeet is doing:

@@ -9028,9 +9028,7 @@
 CALL terminate_backend_and_wait('fdw_retry_check');
 SAVEPOINT s;
 SELECT 1 FROM ft1 LIMIT 1;    -- should fail
-ERROR:  server closed the connection unexpectedly
-    This probably means the server terminated abnormally
-    before or while processing the request.
+ERROR:  could not receive data from server: Software caused connection abort
 CONTEXT:  remote SQL command: SAVEPOINT s2
 COMMIT;
 -- Clean up

which is better --- the connection recovery is working --- but
obviously still not a "pass".

The only way to make this test pass as-is is to hack libpq so that
it deems ECONNABORTED to be a server crash, which frankly I do not
think is a good change.  I don't see any principled reason to think
that it means that rather than a network connectivity failure.

What I've done instead as a stopgap is to adjust the test case to be
insensitive to the exact error message text.  Maybe there's a better
way, but I can't think of anything besides having two (or more?)
expected-output files.  That would be quite unpalatable as things
stand, though perhaps we could make it tolerable by splitting this
test off into a second test script.

Meanwhile, now that I've looked at commit 32a9c0bdf, I'm not very
happy with it:

* The control flow seems rather forced.  I think it was designed
on the assumption that reindenting the existing code is forbidden.
Why not use an actual loop, instead of a goto?  I also think that
it's far from obvious that the loop isn't an infinite loop; it
took me quite a while to glom onto the fact that the test inside the
PG_CATCH avoids that by checking retry_conn.  Maybe a comment would
be enough to improve that, but I feel the control logic could stand
a rethink.

* The CATCH case is completely failing to clean up after itself.
At the minimum, there has to be a FlushErrorState() there.
I don't think it'd be terribly hard to drive this code to an
error-stack-overflow PANIC.

* As is so often true of proposed patches in which PG_CATCH is
thought to be an adequate way to catch an error, this is just
unbelievably fragile.  It will break, and badly so, if it catches
an error that is anything other than what it is expecting ...
and it's not even particularly trying to verify that the error is
what it's expecting.  It might be okay, or at least a little bit
harder to break, if it verified that the error's SQLSTATE is
ERRCODE_CONNECTION_FAILURE.

            regards, tom lane



pgsql-committers by date:

Previous
From: Tom Lane
Date:
Subject: pgsql: Band-aid new postgres_fdw test case to remove error text depende
Next
From: Michael Paquier
Date:
Subject: pgsql: Use perfect hash for NFC and NFKC Unicode Normalization quick ch