Re: [Proposal] Add foreign-server health checks infrastructure - Mailing list pgsql-hackers

From Önder Kalacı
Subject Re: [Proposal] Add foreign-server health checks infrastructure
Date
Msg-id CACawEhW56PHQ83Q59x4U5zpi0rVJ6=0Vn-FeYSLZ13Y0yasebQ@mail.gmail.com
Whole thread Raw
In response to RE: [Proposal] Add foreign-server health checks infrastructure  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses RE: [Proposal] Add foreign-server health checks infrastructure
List pgsql-hackers
Hi Hayota Kuroda,


I (and my company) worried about overnight batch processing that
contains some accesses to foreign servers. If the transaction is opened overnight and
one of foreign servers is crashed during it, the transaction must be rollbacked.
But there is a possibility that DBAs do not recognize the crash and
they waste a time until the morning. This problem may affect customer's business.
(It may not be sufficient to check the status from another different server.
DBAs must check the network between the databases, and they may be oversight.)
This is a motivation we thought.

Thanks for the clarification, I agree that this might be a valid concern for systems.
 

So how about implementing a check function as an SQL function once and update incrementally?

I think a SQL function makes sense. 
 
This still satisfy our motivation and it can avoid overhead because we can reduce the number of calling it.

Yes, it makes sense. Also, it allows other extensions to utilize the new libpq function, which is neat.
 
If we decide that we establish a new connection in the checking function, we can refactor the it.
And if we decide that we introduce health-check BGworker, then we can add a process that calls implemented function periodically.

Right, your approach doesn't conflict with a more sophisticated approach, in fact is a useful building block.
 

Note that poll() is used here, it means that currently this function can be used on some limited platforms.


I think more experienced hackers could guide us here. I don't see a problem with that, but checking other functions like pqSocketPoll(), I see that Postgres uses the select system call. I wonder if we can have something similar with select? Seems no, but wanted to bring up in case you know more about that?
 
I have added a parameter check_all that controls the scope of to-be-checked servers,
But this is not related with my motivation so we can remove if not needed.

I guess it kind of makes sense to have the flexibility to check all connections vs only in tx connections.

Though, maybe we should follow a similar approach to postgres_fdw_disconnect(servername) / postgres_fdw_disconnect_all()

    postgres_fdw_verify_connection_states(servername) / postgres_fdw_verify_connection_states_all()

That seems like a more natural API considering the other UDFs. You can also use in combination with postgres_fdw_get_connections()


(I have not implemented another version that uses epoll() or kqueue(),
because they seem to be not called on the libpq layer. Do you know any reasons?)

 
Hmm, I don't know the reason. Is that maybe epoll is available on a smaller number of platforms and libpq can be used on more platforms as being a client library?

Now, some comments regarding the v18:

+static int
+pqConncheck_internal(int sock)
+{
+#if (defined(HAVE_POLL) && defined(POLLRDHUP))
+ struct pollfd input_fd;
+
+ input_fd.fd = sock;
+ input_fd.events = POLLRDHUP;
+ input_fd.revents = 0;
+
+ poll(&input_fd, 1, 0);
+
+ return input_fd.revents & POLLRDHUP;

WaitEventSetWaitBlock's defined(WAIT_USE_POLL) branch uses the following check to find WL_SOCKET_CLOSED

#ifdef POLLRDHUP
if ((cur_event->events & WL_SOCKET_CLOSED) &&
(cur_pollfd->revents & (POLLRDHUP | errflags)))
{
/* remote peer closed, or error */
occurred_events->events |= WL_SOCKET_CLOSED;
}
#endif

Where errflags = POLLHUP | POLLERR | POLLNVAL;

So, should we also be using all these flags like: input_fd.events = POLLRDHUP | errflags;  ?


+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("could not connect to server \"%s\"",
+ server->servername),
+ errdetail("Socket close is detected."),
+ errhint("Please check the health of the server.")));

Is erroring out always necessary? Maybe we should just return true/false, and let the caller decide what to do? For example, you might want to rollback to savepoint and retry?  If the caller wants to rollback the whole transaction, that is possible anyway.

Maybe similar to postgres_fdw_disconnect(), we can warn if the connection is involved in a tx (e.g., depth > 0)? 

new file mode 100644
index 0000000000..3ce169c837
--- /dev/null
+++ b/contrib/postgres_fdw/expected/postgres_fdw_1.out
@@ -0,0 +1,11615 @@
+-- ===================================================================
+-- create FDW objects
+-- ===================================================================
+CREATE EXTENSION postgres_fdw;
+CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
+DO $d$
+    BEGIN

Do we really need this new file or just an oversight in your patch?

Thanks,
Onder KALACI

Software Engineer at Microsoft &

Developing the Citus database extension for PostgreSQL 

pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Next
From: Konstantin Knizhnik
Date:
Subject: Re: Lack of PageSetLSN in heap_xlog_visible