RE: [Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id TYAPR01MB58668728393648C2F7DC7C85F5399@TYAPR01MB5866.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [Proposal] Add foreign-server health checks infrastructure  (Önder Kalacı <onderkalaci@gmail.com>)
Responses Re: [Proposal] Add foreign-server health checks infrastructure
List pgsql-hackers
Dear Önder, all,

Thank you for responding and sorry for late response.

> A transaction failing with a bad error message (or holding some resources
> locally until the transaction is committed) doesn't sound essential to me.
> Is there any specific workload are you referring for optimizing to rollback
> a transaction earlier if a remote server dies?  What kind of workload would
> benefit from that? Maybe there is, but not clear to me and haven't seen
> discussed on the thread (sorry if I missed).

I (and my company) worried about overnight batch processing that
contains some accesses to foreign servers. If the transaction is opened overnight and
one of foreign servers is crashed during it, the transaction must be rollbacked.
But there is a possibility that DBAs do not recognize the crash and
they waste a time until the morning. This problem may affect customer's business.
(It may not be sufficient to check the status from another different server.
DBAs must check the network between the databases, and they may be oversight.)
This is a motivation we thought.

> I'm trying to understand if we are trying to solve a problem that does not
> really exists. I'm bringing this up, because I often deal with
> architectures where there is a local node and remote transaction on
> different Postgres servers. And, I have not encountered many (or any)
> pattern that'd benefit from this change much. In fact, I think, on the
> contrary, this might add significant overhead for OLTP type of high query
> throughput systems.

As I said above, I did not considered about OLTP system. But I agreed that the current
callback mechanism may have significant overhead.

Actually, we may not decide the correct way to detect the failure now.
Your argument, which operations should be done by BGworker and we record stats about checking,
seems to be efficient and may be smarter but this may be not match my motivation now.
My approach may have large overhead and may be not able to use for OLTP system.


So how about implementing a check function as an SQL function once and update incrementally?
This still satisfy our motivation and it can avoid overhead because we can reduce the number of calling it.
If we decide that we establish a new connection in the checking function, we can refactor the it.
And if we decide that we introduce health-check BGworker, then we can add a process that calls implemented function
periodically.

PSA patchset that implemented as an SQL function. I moved the checking function to libpq layer, fe-misc.c.
Note that poll() is used here, it means that currently this function can be used on some limited platforms.

I have added a parameter check_all that controls the scope of to-be-checked servers,
But this is not related with my motivation so we can remove if not needed.

(I have not implemented another version that uses epoll() or kqueue(),
because they seem to be not called on the libpq layer. Do you know any reasons?)

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Glossary and initdb definition work for "superuser" and database/cluster
Next
From: Masahiko Sawada
Date:
Subject: Re: Perform streaming logical transactions by background workers and parallel apply