Re: [Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From Önder Kalacı
Subject Re: [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id CACawEhW19nPfbFpvfke9eidFDxAy+ic36wmY0s936T=xzxgHog@mail.gmail.com
Whole thread Raw
In response to RE: [Proposal] Add foreign-server health checks infrastructure  ("kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com>)
Responses RE: [Proposal] Add foreign-server health checks infrastructure
List pgsql-hackers
Hi,

Sounds reasonable. Do you mean that we can add additional GUC like "postgres_fdw.initial_check",
wait WL_SOCKET_CLOSED if the conneciton is found in the hash table, and do reconnection if it might be closed, right?


Alright, it took me sometime to realize that postgres_fdw already has a retry mechanism if the first command fails: postgres_fdw: reestablish new connection if cached one is detected as… · postgres/postgres@32a9c0b (github.com) 

Still, the reestablish mechanism can be further simplified with WL_SOCKET_CLOSED event such as the following (where we should probably rename pgfdw_connection_check_internal):

 /*
* If the connection needs to be remade due to invalidation, disconnect as
* soon as we're out of all transactions.
*/
 | +bool remoteSocketIsClosed =  entry->conn != NULL : pgfdw_connection_check_internal(entry->conn) : false;
if (entry->conn != NULL && (entry->invalidated || remoteSocketIsClosed) && entry->xact_depth == 0)
{
elog(DEBUG3, "closing connection %p for option changes to take effect",
entry->conn);
disconnect_pg_server(entry);
}
 | +else if (remoteSocketIsClosed && && entry->xact_depth > 0)
 | + error ("Remote Server is down ...")


In other words, a variation of pgfdw_connection_check_internal() could potentially go into interfaces/libpq/libpq-fe.h (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c). Then, GetConnection() in postgres_fdw, it can force to reconnect as it is already done for some cases or error properly:
 

 Based on off and on discussions, I modified my patch.


I still think that it is probably too much work/code to detect the mentioned use-case you described on [1]. Each backend constantly calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound the optimal way to approach the "check whether server down" problem. You typically try to decide whether a server is down by establishing a connection (or ping etc), not going over all the existing connections. 

As far as I can think of, it should probably be a single background task checking whether the server is down. If so, sending an invalidation message to all the backends such that related backends could act on the invalidation and throw an error. This is to cover the use-case you described on [1].

Also, maybe we could have a new catalog table like pg_foreign_server_health or such, where we can keep the last time the check succeeded (and/or failed), and how many times the check  succeeded (and/or failed).

This is of course how I would approach this problem. I think some other perspectives on this would be very useful to hear.

Thanks,
Onder KALACI



 

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Schema variables - new implementation for Postgres 15
Next
From: Zhang Mingli
Date:
Subject: Re: Issue in GIN fast-insert: XLogBeginInsert + Read/LockBuffer ordering