Re: [Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From Önder Kalacı
Subject Re: [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id CACawEhUzpqYJ8mQmSjYgX0ePtPpvb2u9Onjf6pCjUGkoZ=-xSg@mail.gmail.com
Whole thread Raw
In response to RE: [Proposal] Add foreign-server health checks infrastructure  ("kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com>)
Responses RE: [Proposal] Add foreign-server health checks infrastructure  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
Hi,

> As far as I can think of, it should probably be a single background task
> checking whether the server is down. If so, sending an invalidation message
> to all the backends such that related backends could act on the
> invalidation and throw an error. This is to cover the use-case you
> described on [1].

Indeed your approach covers the use case I said, but I'm not sure whether it is really good.
In your approach, once the background worker process will manage all foreign servers.
It may be OK if there are a few servers, but if there are hundreds of servers,
the time interval during checks will be longer.

I expect users typically will have a lot more backends than the servers. We can have a threshold for spinning a new bg worker (e.g., every 10 servers gets a new bg worker etc.). Still, I think that'd be an optimization that is probably not necessary for the majority of the users?
 
Currently, each FDW can decide whether we do health checks or not per the backend.
For example, we can skip health checks if the foreign server is not used now.
The background worker cannot control such a way.
Based on the above, I do not agree that we introduce a new background worker and make it to do a health check.

Again, the definition of "health check" is probably different for me. I'd expect the health check to happen continuously, ideally keeping track of how many consecutive times it succeeded and/or last time it failed/succeeded etc.

A transaction failing with a bad error message (or holding some resources locally until the transaction is committed) doesn't sound essential to me. Is there any specific workload are you referring for optimizing to rollback a transaction earlier if a remote server dies?  What kind of workload would benefit from that? Maybe there is, but not clear to me and haven't seen discussed on the thread (sorry if I missed).

I'm trying to understand if we are trying to solve a problem that does not really exists. I'm bringing this up, because I often deal with architectures where there is a local node and remote transaction on different Postgres servers. And, I have not encountered many (or any) pattern that'd benefit from this change much. In fact, I think, on the contrary, this might add significant overhead for OLTP type of high query throughput systems.
 
Moreover, methods to connect to foreign servers and check health are different per FDW.
In terms of mysql_fdw [1], we must do mysql_init() and mysql_real_connect().
About file_fdw, we do not have to connect, but developers may want to calculate checksum and compare.
Therefore, we must provide callback functions anyway.


I think providing callback functions is useful for any case. Each fdw (or in general extension) should be able to provide its own "health check" function.
 
Thanks,
Onder KALACI

pgsql-hackers by date:

Previous
From: Melih Mutlu
Date:
Subject: Re: Mingw task for Cirrus CI
Next
From: Bharath Rupireddy
Date:
Subject: Re: Move backup-related code to xlogbackup.c/.h