[Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From kuroda.hayato@fujitsu.com
Subject [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id TYAPR01MB58662809E678253B90E82CE5F5889@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
Responses Re: [Proposal] Add foreign-server health checks infrastructure
List pgsql-hackers
Dear Hackers,

I want to propose the feature that checks the health of foreign servers.
As a first step I want to add an infrastructure for periodical checking to PostgreSQL core.
Currently this is the WIP, it does not contain docs.

## Background

Currently there is no way to check the status of an foreign server in PostgreSQL.
If an foreign server's postmaster goes down, or if the network between servers is lost,
the backend process will only detect these when it uses the connection corresponding to that foreign server.

Consider a workload that updates data on an foreign server only at the beginning of a transaction,
and runs a lot of local SQLs. Even if the network is disconnected immediately after accessing the foreign server,
the backend process will continue to execute local SQLs without realizing it.

The process will eventually finish to execute SQLs and try to commit.
Only then will it realize that the foreign server cannot be connect and will abort the transaction.
This situation should be detected as soon as possible
because it is impossible to commit a transaction when the foreign server goes down.
This can be more of a problem if you have system-wide downtime requirements.
That's why I want to implement the health-check feature to postgres.

## Design

In general, PostgreSQL can have a variety of RDBMSs as foreign servers,
so the core cannot support all of them.
Therefore, I propose a method to leave the monitoring of the foreign server to each FDW extensions
and register it as a callback function on the body side.
The attached patch adds this monitoring infrastructure to core.
Within the callback functions, it is expected
that each FDWs will check the state of the connection they hold and call ereport (ERROR)
if it cannot connect to someone.
Of course, you can also have the callback function return false.
There is no particular reason to choose the current method.
Callback functions will be called periodically.

## Implementation

This patch introduces a new timeout and a new GUC parameter. GUC controls the timeout interval.
The timeout takes effect when the callback function is first registered,
before each SQL command is executed, and at the end of the timeout.
This implementation is based on the client_connection_check_interval and other timeouts.

## Further work

As the next step I have a plan to implement the callback function to postgres_fdw.
I already made a PoC, but it deeply depends on the following thread:
https://commitfest.postgresql.org/35/3098/

I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand such a dependency
and will throw build error. Do you know how to deal with them in this case?

Your comments and suggestions are very welcome.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Pavel Stehule
Date:
Subject: Re: Schema variables - new implementation for Postgres 15