Sounds reasonable. Do you mean that we can add additional GUC like "postgres_fdw.initial_check", wait WL_SOCKET_CLOSED if the conneciton is found in the hash table, and do reconnection if it might be closed, right?
Still, the reestablish mechanism can be further simplified with WL_SOCKET_CLOSED event such as the following (where we should probably rename pgfdw_connection_check_internal):
/* * If the connection needs to be remade due to invalidation, disconnect as * soon as we're out of all transactions. */
{ elog(DEBUG3, "closing connection %p for option changes to take effect", entry->conn); disconnect_pg_server(entry); }
| +else if (remoteSocketIsClosed && && entry->xact_depth > 0)
| + error ("Remote Server is down ...")
In other words, a variation of pgfdw_connection_check_internal() could potentially go into interfaces/libpq/libpq-fe.h (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c). Then, GetConnection() in postgres_fdw, it can force to reconnect as it is already done for some cases or error properly:
Based on off and on discussions, I modified my patch.
I still think that it is probably too much work/code to detect the mentioned use-case you described on [1]. Each backend constantly calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound the optimal way to approach the "check whether server down" problem. You typically try to decide whether a server is down by establishing a connection (or ping etc), not going over all the existing connections.
As far as I can think of, it should probably be a single background task checking whether the server is down. If so, sending an invalidation message to all the backends such that related backends could act on the invalidation and throw an error. This is to cover the use-case you described on [1].
Also, maybe we could have a new catalog table like pg_foreign_server_health or such, where we can keep the last time the check succeeded (and/or failed), and how many times the check succeeded (and/or failed).
This is of course how I would approach this problem. I think some other perspectives on this would be very useful to hear.