RE: [Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From kuroda.hayato@fujitsu.com
Subject RE: [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id TYAPR01MB5866CFD6BAE6DDF01A27CBF1F5299@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [Proposal] Add foreign-server health checks infrastructure  (Önder Kalacı <onderkalaci@gmail.com>)
Responses Re: [Proposal] Add foreign-server health checks infrastructure  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: [Proposal] Add foreign-server health checks infrastructure  (Önder Kalacı <onderkalaci@gmail.com>)
List pgsql-hackers
Dear Önder,

Thanks for giving suggestions!

> Still, the reestablish mechanism can be further simplified with
> WL_SOCKET_CLOSED event such as the following (where we should probably
> rename pgfdw_connection_check_internal):

Sounds reasonable.
I think it may be included in this patch. I will try to next (or later) version.

> In other words, a variation of pgfdw_connection_check_internal()
> could potentially go into interfaces/libpq/libpq-fe.h
> (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c).

Hmm, IIUC libpq related function and data structures cannot be accessed from core source,
so we cannot move to pqcomm.c.
(This is a motivation for introducing libpqwalreceiver library. It is used to avoid to link libpq directly)
And functions in libpq/fe-connect.c will be included libpq.so,
but latch related functions like WaitEventSetWait() should not be called from client application.
So it is also not appropriate.
In short, there are no good position to place the function because this requires both of libpq and core functions.


> I still think that it is probably too much work/code to detect the
> mentioned use-case you described on [1]. Each backend constantly
> calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound
> the optimal way to approach the "check whether server down" problem. You
> typically try to decide whether a server is down by establishing a
> connection (or ping etc), not going over all the existing connections.

Yes, the approach that establishes a new connection is very simple, but I think it has some holes.
For example, if the DNS server or some routing software may is stopped,
we will fail to connect to foreign servers.
In your approach, we regard the case as "failed" and try to invalidate the server
even if the existing connection can be used.
Another case is that if a server goes down and failover has occurred, we will succeed to connect to foreign servers.
We may regard the case as a "success", but we cannot COMMIT the transaction.


> As far as I can think of, it should probably be a single background task
> checking whether the server is down. If so, sending an invalidation message
> to all the backends such that related backends could act on the
> invalidation and throw an error. This is to cover the use-case you
> described on [1].

Indeed your approach covers the use case I said, but I'm not sure whether it is really good. 
In your approach, once the background worker process will manage all foreign servers.
It may be OK if there are a few servers, but if there are hundreds of servers,
the time interval during checks will be longer.

Currently, each FDW can decide whether we do health checks or not per the backend.
For example, we can skip health checks if the foreign server is not used now.
The background worker cannot control such a way.

Moreover, methods to connect to foreign servers and check health are different per FDW.
In terms of mysql_fdw [1], we must do mysql_init() and mysql_real_connect().
About file_fdw, we do not have to connect, but developers may want to calculate checksum and compare.
Therefore, we must provide callback functions anyway.

Based on the above, I do not agree that we introduce a new background worker and make it to do a health check.

> This is of course how I would approach this problem. I think some other
> perspectives on this would be very useful to hear.

Yes, we can hear other opinions :-).

[1]: https://github.com/EnterpriseDB/mysql_fdw

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

 

pgsql-hackers by date:

Previous
From: "Drouvot, Bertrand"
Date:
Subject: Re: Add regular expression testing for user name mapping in the peer authentication TAP test
Next
From: Michael Paquier
Date:
Subject: Re: Avoid memory leaks during base backups