Thread: [Proposal] Add foreign-server health checks infrastructure

[Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Hackers,

I want to propose the feature that checks the health of foreign servers.
As a first step I want to add an infrastructure for periodical checking to PostgreSQL core.
Currently this is the WIP, it does not contain docs.

## Background

Currently there is no way to check the status of an foreign server in PostgreSQL.
If an foreign server's postmaster goes down, or if the network between servers is lost,
the backend process will only detect these when it uses the connection corresponding to that foreign server.

Consider a workload that updates data on an foreign server only at the beginning of a transaction,
and runs a lot of local SQLs. Even if the network is disconnected immediately after accessing the foreign server,
the backend process will continue to execute local SQLs without realizing it.

The process will eventually finish to execute SQLs and try to commit.
Only then will it realize that the foreign server cannot be connect and will abort the transaction.
This situation should be detected as soon as possible
because it is impossible to commit a transaction when the foreign server goes down.
This can be more of a problem if you have system-wide downtime requirements.
That's why I want to implement the health-check feature to postgres.

## Design

In general, PostgreSQL can have a variety of RDBMSs as foreign servers,
so the core cannot support all of them.
Therefore, I propose a method to leave the monitoring of the foreign server to each FDW extensions
and register it as a callback function on the body side.
The attached patch adds this monitoring infrastructure to core.
Within the callback functions, it is expected
that each FDWs will check the state of the connection they hold and call ereport (ERROR)
if it cannot connect to someone.
Of course, you can also have the callback function return false.
There is no particular reason to choose the current method.
Callback functions will be called periodically.

## Implementation

This patch introduces a new timeout and a new GUC parameter. GUC controls the timeout interval.
The timeout takes effect when the callback function is first registered,
before each SQL command is executed, and at the end of the timeout.
This implementation is based on the client_connection_check_interval and other timeouts.

## Further work

As the next step I have a plan to implement the callback function to postgres_fdw.
I already made a PoC, but it deeply depends on the following thread:
https://commitfest.postgresql.org/35/3098/

I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand such a dependency
and will throw build error. Do you know how to deal with them in this case?

Your comments and suggestions are very welcome.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Shinya Kato
Date:
Hi,
Thank you for the patch!

On 2021-10-30 12:50, kuroda.hayato@fujitsu.com wrote:

> ## Background
> 
> Currently there is no way to check the status of an foreign server in
> PostgreSQL.
> If an foreign server's postmaster goes down, or if the network between
> servers is lost,
> the backend process will only detect these when it uses the connection
> corresponding to that foreign server.
> 
> Consider a workload that updates data on an foreign server only at the
> beginning of a transaction,
> and runs a lot of local SQLs. Even if the network is disconnected
> immediately after accessing the foreign server,
> the backend process will continue to execute local SQLs without 
> realizing it.
> 
> The process will eventually finish to execute SQLs and try to commit.
> Only then will it realize that the foreign server cannot be connect
> and will abort the transaction.
> This situation should be detected as soon as possible
> because it is impossible to commit a transaction when the foreign
> server goes down.
> This can be more of a problem if you have system-wide downtime 
> requirements.
> That's why I want to implement the health-check feature to postgres.

It's a good idea. I also think such a situation should be avoided.

> ## Further work
> 
> As the next step I have a plan to implement the callback function to
> postgres_fdw.
> I already made a PoC, but it deeply depends on the following thread:
> https://commitfest.postgresql.org/35/3098/
> 
> I also want you to review the postgres_fdw part,
> but I think it should not be attached because cfbot cannot understand
> such a dependency
> and will throw build error. Do you know how to deal with them in this 
> case?

I don't know how to deal with them, but I hope you will attach the PoC, 
as it may be easier to review.

-- 
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Kato-san,

Thank you for your interest!

> > I also want you to review the postgres_fdw part,
> > but I think it should not be attached because cfbot cannot understand
> > such a dependency
> > and will throw build error. Do you know how to deal with them in this
> > case?
>
> I don't know how to deal with them, but I hope you will attach the PoC,
> as it may be easier to review.

OK, I attached the PoC along with the dependent patches. Please see the zip file.
add_helth_check_... patch is written by me, and other two patches are
just copied from [1].
In the new callback function ConnectionHash is searched sequentially and
WaitEventSetWait() is performed for WL_SOCKET_CLOSED socket event.
This event is added by the dependent ones.

===
How to use
===

I'll explain how to use it briefly.

1. boot two postmaster processes. One is coordinator, and another is worker
2. set remote_servers_connection_check_interval to non-zero value at the coordinator
3. create tables to worker DB-cluster.
4. create foreign server, user mapping, and foreign table to coordinator.
5. connect to coordinator via psql.
6. open a transaction and access to foreing tables.
7. do "pg_ctl stop" command to woker DB-cluser.
8. execute some commands that does not access an foreign table.
9. Finally the following output will be get:

ERROR:  Postgres foreign server XXX might be down.

===
Example in some steps
===

3. at worker

```
postgres=# \d
        List of relations
 Schema |  Name  | Type  | Owner
--------+--------+-------+--------
 public | remote | table | hayato
(1 row)
```

4. at coordinator

```
postgres=# select * from pg_foreign_server ;
  oid  | srvname | srvowner | srvfdw | srvtype | srvversion | srvacl |         srvoptions
-------+---------+----------+--------+---------+------------+--------+-----------------------------
 16406 | remote  |       10 |  16402 |         |            |        | {port=5433,dbname=postgres}
(1 row)

postgres=# select * from pg_user_mapping ;
  oid  | umuser | umserver |   umoptions
-------+--------+----------+---------------
 16407 |     10 |    16406 | {user=hayato}
(1 row)

postgres=# \d
            List of relations
 Schema |  Name  |     Type      | Owner
--------+--------+---------------+--------
 public | local  | table         | hayato
 public | remote | foreign table | hayato
(2 rows)
```

6-9. at coordinator

```
postgres=# begin;
BEGIN
postgres=*# select * from remote ;
 id
----
  1
(1 row)

postgres=*# select * from local ;
ERROR:  Postgres foreign server remote might be down.
postgres=!#
```

Note that some keepalive settings are needed
if you want to detect cable breakdown events.
In my understanding following parameters are needed as server options:

* keepalives_idle
* keepalives_count
* keepalives_interval

[1]: https://commitfest.postgresql.org/35/3098/

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Shinya Kato
Date:
On 2021-11-18 21:43, kuroda.hayato@fujitsu.com wrote:
> Dear Kato-san,
> 
> Thank you for your interest!
> 
>> > I also want you to review the postgres_fdw part,
>> > but I think it should not be attached because cfbot cannot understand
>> > such a dependency
>> > and will throw build error. Do you know how to deal with them in this
>> > case?
>> 
>> I don't know how to deal with them, but I hope you will attach the 
>> PoC,
>> as it may be easier to review.
> 
> OK, I attached the PoC along with the dependent patches. Please see
> the zip file.
> add_helth_check_... patch is written by me, and other two patches are
> just copied from [1].
> In the new callback function ConnectionHash is searched sequentially 
> and
> WaitEventSetWait() is performed for WL_SOCKET_CLOSED socket event.
> This event is added by the dependent ones.

Thank you for sending the patches!
I confirmed that they can be compiled and tested successfully on CentOS 
8.

I haven't looked at the core of the code yet, but I took a quick look at 
it.

+    {
+        {"remote_servers_connection_check_interval", PGC_USERSET, 
CONN_AUTH_SETTINGS,
+            gettext_noop("Sets the time interval between checks for 
disconnection of remote servers."),
+            NULL,
+            GUC_UNIT_MS
+        },
+        &remote_servers_connection_check_interval,
+        0, 0, INT_MAX,
+    },

If you don't use check_hook, assign_hook and show_hook, you should 
explicitly write "NULL, NULL, NULL", as show below.
+    {
+        {"remote_servers_connection_check_interval", PGC_USERSET, 
CONN_AUTH_SETTINGS,
+            gettext_noop("Sets the time interval between checks for 
disconnection of remote servers."),
+            NULL,
+            GUC_UNIT_MS
+        },
+        &remote_servers_connection_check_interval,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },


+            ereport(ERROR,
+                    errcode(ERRCODE_CONNECTION_FAILURE),
+                    errmsg("Postgres foreign server %s might be down.",
+                            server->servername));

According to [1], error messages should start with a lowercase letter 
and not use a period.
Also, along with the rest of the code, it is a good idea to enclose the 
server name in double quotes.

I'll get back to you once I've read all the code.

[1] https://www.postgresql.org/docs/devel/error-style-guide.html

-- 
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Kato-san,

Thank you for reviewing!

> Thank you for sending the patches!
> I confirmed that they can be compiled and tested successfully on CentOS
> 8.

Thanks!

> +    {
> +        {"remote_servers_connection_check_interval", PGC_USERSET,
> CONN_AUTH_SETTINGS,
> +            gettext_noop("Sets the time interval between checks
> for
> disconnection of remote servers."),
> +            NULL,
> +            GUC_UNIT_MS
> +        },
> +        &remote_servers_connection_check_interval,
> +        0, 0, INT_MAX,
> +    },
>
> If you don't use check_hook, assign_hook and show_hook, you should
> explicitly write "NULL, NULL, NULL", as show below.

Yeah I forgot the line. Fixed.

> +            ereport(ERROR,
> +
>     errcode(ERRCODE_CONNECTION_FAILURE),
> +                    errmsg("Postgres foreign server %s
> might be down.",
> +
>     server->servername));
>
> According to [1], error messages should start with a lowercase letter
> and not use a period.
> Also, along with the rest of the code, it is a good idea to enclose the
> server name in double quotes.

I confirmed the postgres error-reporting policy and fixed to follow that.
How do you think?

Attached are the latest version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Kato-san,

I found the missing space, and I added a test.
I'm very happy if you review this.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Zhihong Yu
Date:


On Tue, Nov 23, 2021 at 8:57 PM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote:
Dear Kato-san,

Thank you for reviewing!

> Thank you for sending the patches!
> I confirmed that they can be compiled and tested successfully on CentOS
> 8.

Thanks!

> +     {
> +             {"remote_servers_connection_check_interval", PGC_USERSET,
> CONN_AUTH_SETTINGS,
> +                     gettext_noop("Sets the time interval between checks
> for
> disconnection of remote servers."),
> +                     NULL,
> +                     GUC_UNIT_MS
> +             },
> +             &remote_servers_connection_check_interval,
> +             0, 0, INT_MAX,
> +     },
>
> If you don't use check_hook, assign_hook and show_hook, you should
> explicitly write "NULL, NULL, NULL", as show below.

Yeah I forgot the line. Fixed.

> +                     ereport(ERROR,
> +
>       errcode(ERRCODE_CONNECTION_FAILURE),
> +                                     errmsg("Postgres foreign server %s
> might be down.",
> +
>       server->servername));
>
> According to [1], error messages should start with a lowercase letter
> and not use a period.
> Also, along with the rest of the code, it is a good idea to enclose the
> server name in double quotes.

I confirmed the postgres error-reporting policy and fixed to follow that.
How do you think?

Attached are the latest version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hi,

+#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()  (CheckingRemoteServersHoldoffCount++)

The macro contains only one operation. Can the macro be removed (with `CheckingRemoteServersHoldoffCount++` inlined) ?

+   if (CheckingRemoteServersTimeoutPending && CheckingRemoteServersHoldoffCount != 0)
+   {
+       /*
+        * Skip checking foreign servers while reading messages.
+        */
+       InterruptPending = true;
+   }
+   else if (CheckingRemoteServersTimeoutPending)

Would the code be more readable if the above two if blocks be moved inside one enclosing if block (factoring the common condition)?

+   if (CheckingRemoteServersTimeoutPending)

Cheers

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Zhihong,

Thank you for giving comments! I'll post new patches later.

> +#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()  (CheckingRemoteServersHoldoffCount++)
>
> The macro contains only one operation. Can the macro be removed (with `CheckingRemoteServersHoldoffCount++` inlined)
?

Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and HOLD_CANCEL_INTERRUPTS():

```
#define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)

#define RESUME_INTERRUPTS() \
do { \
        Assert(InterruptHoldoffCount > 0); \
        InterruptHoldoffCount--; \
} while(0)

#define HOLD_CANCEL_INTERRUPTS()  (QueryCancelHoldoffCount++)

#define RESUME_CANCEL_INTERRUPTS() \
do { \
        Assert(QueryCancelHoldoffCount > 0); \
        QueryCancelHoldoffCount--; \
} while(0)

#define START_CRIT_SECTION()  (CritSectionCount++)

#define END_CRIT_SECTION() \
do { \
        Assert(CritSectionCount > 0); \
        CritSectionCount--; \
} while(0)
```

So I want to keep the current style. Could you tell me if you have any other reasons?

> +   if (CheckingRemoteServersTimeoutPending && CheckingRemoteServersHoldoffCount != 0)
> +   {
> +       /*
> +        * Skip checking foreign servers while reading messages.
> +        */
> +       InterruptPending = true;
> +   }
> +   else if (CheckingRemoteServersTimeoutPending)
>
> Would the code be more readable if the above two if blocks be moved inside one enclosing if block (factoring the
commoncondition)? 
>
> +   if (CheckingRemoteServersTimeoutPending)

+1. Will fix.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Hackers,

I attached new version that is based from Zhihong's comments.
And I newly attached a doc patch. I think the infrastructure part is no longer WIP.

Could you review them?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Zhihong Yu
Date:


On Mon, Nov 29, 2021 at 12:51 AM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote:
Dear Zhihong,

Thank you for giving comments! I'll post new patches later.

> +#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()  (CheckingRemoteServersHoldoffCount++)
>
> The macro contains only one operation. Can the macro be removed (with `CheckingRemoteServersHoldoffCount++` inlined) ?

Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and HOLD_CANCEL_INTERRUPTS():

```
#define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)

#define RESUME_INTERRUPTS() \
do { \
        Assert(InterruptHoldoffCount > 0); \
        InterruptHoldoffCount--; \
} while(0)

#define HOLD_CANCEL_INTERRUPTS()  (QueryCancelHoldoffCount++)

#define RESUME_CANCEL_INTERRUPTS() \
do { \
        Assert(QueryCancelHoldoffCount > 0); \
        QueryCancelHoldoffCount--; \
} while(0)

#define START_CRIT_SECTION()  (CritSectionCount++)

#define END_CRIT_SECTION() \
do { \
        Assert(CritSectionCount > 0); \
        CritSectionCount--; \
} while(0)
```

So I want to keep the current style. Could you tell me if you have any other reasons?

> +   if (CheckingRemoteServersTimeoutPending && CheckingRemoteServersHoldoffCount != 0)
> +   {
> +       /*
> +        * Skip checking foreign servers while reading messages.
> +        */
> +       InterruptPending = true;
> +   }
> +   else if (CheckingRemoteServersTimeoutPending)
>
> Would the code be more readable if the above two if blocks be moved inside one enclosing if block (factoring the common condition)?
>
> +   if (CheckingRemoteServersTimeoutPending)

+1. Will fix.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hi,

It is Okay to keep the macros.

Thanks

Re: [Proposal] Add foreign-server health checks infrastructure

From
Shinya Kato
Date:
On 2021-11-29 21:36, Zhihong Yu wrote:
> On Mon, Nov 29, 2021 at 12:51 AM kuroda.hayato@fujitsu.com
> <kuroda.hayato@fujitsu.com> wrote:
> 
>> Dear Zhihong,
>> 
>> Thank you for giving comments! I'll post new patches later.
>> 
>>> +#define HOLD_CHECKING_REMOTE_SERVERS_INTERRUPTS()
>> (CheckingRemoteServersHoldoffCount++)
>>> 
>>> The macro contains only one operation. Can the macro be removed
>> (with `CheckingRemoteServersHoldoffCount++` inlined) ?
>> 
>> Hmm, these HOLD/RESUME macros are followed HOLD_INTERRUPUST() and
>> HOLD_CANCEL_INTERRUPTS():
>> 
>> ```
>> #define HOLD_INTERRUPTS()  (InterruptHoldoffCount++)
>> 
>> #define RESUME_INTERRUPTS() \
>> do { \
>> Assert(InterruptHoldoffCount > 0); \
>> InterruptHoldoffCount--; \
>> } while(0)
>> 
>> #define HOLD_CANCEL_INTERRUPTS()  (QueryCancelHoldoffCount++)
>> 
>> #define RESUME_CANCEL_INTERRUPTS() \
>> do { \
>> Assert(QueryCancelHoldoffCount > 0); \
>> QueryCancelHoldoffCount--; \
>> } while(0)
>> 
>> #define START_CRIT_SECTION()  (CritSectionCount++)
>> 
>> #define END_CRIT_SECTION() \
>> do { \
>> Assert(CritSectionCount > 0); \
>> CritSectionCount--; \
>> } while(0)
>> ```
>> 
>> So I want to keep the current style. Could you tell me if you have
>> any other reasons?
>> 
>>> +   if (CheckingRemoteServersTimeoutPending &&
>> CheckingRemoteServersHoldoffCount != 0)
>>> +   {
>>> +       /*
>>> +        * Skip checking foreign servers while reading messages.
>>> +        */
>>> +       InterruptPending = true;
>>> +   }
>>> +   else if (CheckingRemoteServersTimeoutPending)
>>> 
>>> Would the code be more readable if the above two if blocks be
>> moved inside one enclosing if block (factoring the common
>> condition)?
>>> 
>>> +   if (CheckingRemoteServersTimeoutPending)
>> 
>> +1. Will fix.
>> 
>> Best Regards,
>> Hayato Kuroda
>> FUJITSU LIMITED
> 
> Hi,
> 
> It is Okay to keep the macros.
> 
> Thanks

Hi, Kuroda-san. Sorry for late reply.

Even for local-only transaction, I thought it useless to execute 
CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I 
make it so that it determines outside whether it contains SQL to the 
remote or not?

The following points are minor.
1. In foreign.c, some of the lines are too long and should be broken.
2. In pqcomm.c, the newline have been removed, what is the intention of 
this?
3. In postgres.c,
3-1. how about inserting a comment between lines 2713 and 2714, similar 
to line 2707?
3-2. the line breaks in line 3242 and line 3243 should be aligned.
3-3. you should change
            /*
            * Skip checking foreign servers while reading messages.
            */
to
            /*
             * Skip checking foreign servers while reading messages.
             */
4. In connection.c, There is a typo in line 1684, so "fucntion" should 
be changed to "function".


-- 
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Kato-san,

Thank you for giving comments! And sorry for late reply.
I rebased my patches.

> Even for local-only transaction, I thought it useless to execute
> CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
> make it so that it determines outside whether it contains SQL to the
> remote or not?

Yeah, remote-checking timeout will be enable even ifa local transaction is opened.
In my understanding, postgres cannot distinguish whether opening transactions
are using only local object or not.
My first idea was that turning on the timeout when GetFdwRoutineXXX functions
were called, but in some cases FDWs may not use remote connection even if
they call such a handler function. e.g. postgresExplainForeignScan().
Hence I tried another approach that unregister all checking callback
the end of each transactions. Only FDWs can notice that transactions are remote or local,
so they must register callback when they really connect to other database.
This implementation will avoid calling remote checking in the case of local transaction,
but multiple registering and unregistering may lead another overhead.
I attached which implements that.

> The following points are minor.
> 1. In foreign.c, some of the lines are too long and should be broken.
> 2. In pqcomm.c, the newline have been removed, what is the intention of
> this?
> 3. In postgres.c,
> 3-1. how about inserting a comment between lines 2713 and 2714, similar
> to line 2707?
> 3-2. the line breaks in line 3242 and line 3243 should be aligned.
> 3-3. you should change
>             /*
>             * Skip checking foreign servers while reading
> messages.
>             */
> to
>             /*
>              * Skip checking foreign servers while reading
> messages.
>              */
> 4. In connection.c, There is a typo in line 1684, so "fucntion" should
> be changed to "function".

Maybe all of them were fixed. Thanks!

How do you think?


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Shinya Kato
Date:
Thank you for the new patch!

On 2021-12-15 15:40, kuroda.hayato@fujitsu.com wrote:
> Dear Kato-san,
> 
> Thank you for giving comments! And sorry for late reply.
> I rebased my patches.
> 
>> Even for local-only transaction, I thought it useless to execute
>> CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
>> make it so that it determines outside whether it contains SQL to the
>> remote or not?
> 
> Yeah, remote-checking timeout will be enable even ifa local
> transaction is opened.
> In my understanding, postgres cannot distinguish whether opening 
> transactions
> are using only local object or not.
> My first idea was that turning on the timeout when GetFdwRoutineXXX 
> functions
> were called, but in some cases FDWs may not use remote connection even 
> if
> they call such a handler function. e.g. postgresExplainForeignScan().
> Hence I tried another approach that unregister all checking callback
> the end of each transactions. Only FDWs can notice that transactions
> are remote or local,
> so they must register callback when they really connect to other 
> database.
> This implementation will avoid calling remote checking in the case of
> local transaction,
> but multiple registering and unregistering may lead another overhead.
> I attached which implements that.
> 
It certainly incurs another overhead, but I think it's better than the 
previous one.
So far, I haven't encountered any problems, but I'd like other people's 
opinions.

>> The following points are minor.
>> 1. In foreign.c, some of the lines are too long and should be broken.
>> 2. In pqcomm.c, the newline have been removed, what is the intention 
>> of
>> this?
>> 3. In postgres.c,
>> 3-1. how about inserting a comment between lines 2713 and 2714, 
>> similar
>> to line 2707?
>> 3-2. the line breaks in line 3242 and line 3243 should be aligned.
>> 3-3. you should change
>>             /*
>>             * Skip checking foreign servers while reading
>> messages.
>>             */
>> to
>>             /*
>>              * Skip checking foreign servers while reading
>> messages.
>>              */
>> 4. In connection.c, There is a typo in line 1684, so "fucntion" should
>> be changed to "function".
> 
> Maybe all of them were fixed. Thanks!
Thank you, and it looks good to me.


-- 
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Zhihong Yu
Date:


On Tue, Jan 4, 2022 at 10:21 PM Shinya Kato <Shinya11.Kato@oss.nttdata.com> wrote:
Thank you for the new patch!

On 2021-12-15 15:40, kuroda.hayato@fujitsu.com wrote:
> Dear Kato-san,
>
> Thank you for giving comments! And sorry for late reply.
> I rebased my patches.
>
>> Even for local-only transaction, I thought it useless to execute
>> CallCheckingRemoteServersCallbacks() and enable_timeout_after(). Can I
>> make it so that it determines outside whether it contains SQL to the
>> remote or not?
>
> Yeah, remote-checking timeout will be enable even ifa local
> transaction is opened.
> In my understanding, postgres cannot distinguish whether opening
> transactions
> are using only local object or not.
> My first idea was that turning on the timeout when GetFdwRoutineXXX
> functions
> were called, but in some cases FDWs may not use remote connection even
> if
> they call such a handler function. e.g. postgresExplainForeignScan().
> Hence I tried another approach that unregister all checking callback
> the end of each transactions. Only FDWs can notice that transactions
> are remote or local,
> so they must register callback when they really connect to other
> database.
> This implementation will avoid calling remote checking in the case of
> local transaction,
> but multiple registering and unregistering may lead another overhead.
> I attached which implements that.
>
It certainly incurs another overhead, but I think it's better than the
previous one.
So far, I haven't encountered any problems, but I'd like other people's
opinions.

>> The following points are minor.
>> 1. In foreign.c, some of the lines are too long and should be broken.
>> 2. In pqcomm.c, the newline have been removed, what is the intention
>> of
>> this?
>> 3. In postgres.c,
>> 3-1. how about inserting a comment between lines 2713 and 2714,
>> similar
>> to line 2707?
>> 3-2. the line breaks in line 3242 and line 3243 should be aligned.
>> 3-3. you should change
>>                      /*
>>                      * Skip checking foreign servers while reading
>> messages.
>>                      */
>> to
>>                      /*
>>                       * Skip checking foreign servers while reading
>> messages.
>>                       */
>> 4. In connection.c, There is a typo in line 1684, so "fucntion" should
>> be changed to "function".
>
> Maybe all of them were fixed. Thanks!
Thank you, and it looks good to me.


Hi,
+       UnregisterAllCheckingRemoteServersCallback();

UnregisterAllCheckingRemoteServersCallback -> UnregisterAllCheckingRemoteServersCallbacks

+   CheckingRemoteServersCallbackItem *item;
+   item = fdw_callbacks;

The above two lines can be combined.

+UnregisterCheckingRemoteServersCallback(CheckingRemoteServersCallback callback,
+                                       void *arg)

Is the above func called anywhere ?

+       if (item->callback == callback && item->arg == arg)

Is comparing argument pointers stable ? What if the same argument is cloned ?

+           CallCheckingRemoteServersCallbacks();
+
+           if (remote_servers_connection_check_interval > 0)
+               enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
+                                    remote_servers_connection_check_interval);

Should the call to CallCheckingRemoteServersCallbacks() be placed under the if block checking remote_servers_connection_check_interval ?
If remote_servers_connection_check_interval is 0, it seems the callbacks shouldn't be called.

Cheers

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2021/12/15 15:40, kuroda.hayato@fujitsu.com wrote:
> Yeah, remote-checking timeout will be enable even ifa local transaction is opened.
> In my understanding, postgres cannot distinguish whether opening transactions
> are using only local object or not.
> My first idea was that turning on the timeout when GetFdwRoutineXXX functions
> were called,

What about starting the timeout in GetConnection(), instead?

> I attached which implements that.

v05_0004_add_tests.patch failed to be applied to the master. Could you rebase it?


-        This option is currently available only on systems that support the
-        non-standard <symbol>POLLRDHUP</symbol> extension to the
-        <symbol>poll</symbol> system call, including Linux.
+        This option relies on kernel events exposed by Linux, macOS, illumos
+        and the BSD family of operating systems, and is not currently available
+        on other systems.

The above change is included in both v5-0003-Use-WL_SOCKET_CLOSED-for-client_connection_check_.patch and
v05_0002_add_doc.patch.If it should be in the former patch, it should be removed from your patch
v05_0002_add_doc.patch.


There seems no user of UnregisterCheckingRemoteServersCallback(). So how about removing it?


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Zhihong,

Thank you for reviewing! And I have to apologize for the late.
I'll post new patchset later.

> +       UnregisterAllCheckingRemoteServersCallback();
> 
> UnregisterAllCheckingRemoteServersCallback
> -> UnregisterAllCheckingRemoteServersCallbacks

Will fix.

> +   CheckingRemoteServersCallbackItem *item;
> +   item = fdw_callbacks;
> 
> The above two lines can be combined.

Will fix.

> +UnregisterCheckingRemoteServersCallback(CheckingRemoteServersCallback
> callback,
> +                                       void *arg)
> 
> Is the above func called anywhere ?

Currently not, I just kept the API because of any other FDW extensions.
But I cannot find any use cases, so will remove.

> +       if (item->callback == callback && item->arg == arg)
> 
> Is comparing argument pointers stable ? What if the same argument is cloned
> ?

This part is no longer needed. Will remove.

> +           CallCheckingRemoteServersCallbacks();
> +
> +           if (remote_servers_connection_check_interval > 0)
> +               enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
> +
>  remote_servers_connection_check_interval);
>
> Should the call to CallCheckingRemoteServersCallbacks() be placed under the
> if block checking remote_servers_connection_check_interval ?
> If remote_servers_connection_check_interval is 0, it seems the callbacks
> shouldn't be called.

Agreed. We force stopping timeout when the GUC sets to 0 in assign_hook,
so your saying is consistent. Will move.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

Thank you for your interest! I'll post new version within several days.

> > Yeah, remote-checking timeout will be enable even ifa local transaction is
> opened.
> > In my understanding, postgres cannot distinguish whether opening transactions
> > are using only local object or not.
> > My first idea was that turning on the timeout when GetFdwRoutineXXX
> functions
> > were called,
> 
> What about starting the timeout in GetConnection(), instead?

Did you said about a function in postgres_fdw/connection.c?
In my understanding that means that the timeout should be enabled or disabled
by each FDW extensions.
I did not find bad cases for that, so I'll change like that and make new APIs.

> v05_0004_add_tests.patch failed to be applied to the master. Could you rebase it?

It's caused because a testcase was added in postgres_fdw. Will rebase.

> The above change is included in both
> v5-0003-Use-WL_SOCKET_CLOSED-for-client_connection_check_.patch and
> v05_0002_add_doc.patch. If it should be in the former patch, it should be removed
> from your patch v05_0002_add_doc.patch.

I confused about doc-patch. Sorry for inconvenience.

> There seems no user of UnregisterCheckingRemoteServersCallback(). So how
> about removing it?

Previously I kept the API for any other extensions, but I cannot find use cases.
Will remove.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san, Zhihong,

I attached the latest version.
The biggest change is that callbacks are no longer un-registered at the end of transactions.
FDW developer must enable or disable timeout instead, via new APIs.

The timer will be turned on when:
* new GUC is >= 0, and
* anyone calls TryEnableRemoteServerCheckingTimeout().

I think this version is reduced overhead, but it might not be developer friendly...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/01/21 15:08, kuroda.hayato@fujitsu.com wrote:
> Dear Fujii-san, Zhihong,
> 
> I attached the latest version.

Thanks for updating the patch!

+int
+TryDisableRemoteServerCheckingTimeout(void)

When more than one FDWs are used, even if one FDW calls this function to disable the timeout, its
remote-server-check-callbackcan still be called. Is this OK?
 
Please imagine the case where two FDWs are used and they registered their callbacks. Even when one FDW calls
TryDisableRemoteServerCheckingTimeout(),if another FDW has not called that yet, the timeout is still being enabled. If
thetimeout is triggered during that period, even the callback registered by the FDW that has already called
TryDisableRemoteServerCheckingTimeout()would be called.
 


+            if (remote_servers_connection_check_interval > 0)
+            {
+                CallCheckingRemoteServersCallbacks();
+                enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
+                                     remote_servers_connection_check_interval);

LockErrorCleanup() needs to be called before reporting the error, doesn't it?


This can cause an error even while DoingCommandRead == true. Is that safe?


> The biggest change is that callbacks are no longer un-registered at the end of transactions.
> FDW developer must enable or disable timeout instead, via new APIs.
> 
> The timer will be turned on when:
> * new GUC is >= 0, and

This can cause the timeout to be enabled even when no remote transaction is started?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

Thank you for reviewing! I attached the latest version.

> When more than one FDWs are used, even if one FDW calls this function to
> disable the timeout, its remote-server-check-callback can still be called. Is this
> OK?
> Please imagine the case where two FDWs are used and they registered their
> callbacks. Even when one FDW calls TryDisableRemoteServerCheckingTimeout(),
> if another FDW has not called that yet, the timeout is still being enabled. If the
> timeout is triggered during that period, even the callback registered by the FDW
> that has already called TryDisableRemoteServerCheckingTimeout() would be
> called.

Indeed and it should be avoided. I added a counter to CheckingRemoteServersCallbackItem.
The register function returns the registered item, and it must be set as the argument for
enable and disable functions.
Callback functions will be called when item->counter is larger than zero.

> +            if (remote_servers_connection_check_interval > 0)
> +            {
> +                CallCheckingRemoteServersCallbacks();
> +
>     enable_timeout_after(CHECKING_REMOTE_SERVERS_TIMEOUT,
> +
> remote_servers_connection_check_interval);
> 
> LockErrorCleanup() needs to be called before reporting the error, doesn't it?

You are right. I think this suggests that error-reporting should be done in the core layer.
For meaningful error reporting, I changed a callback specification
that it should return ForeignServer* which points to downed remote server.

> This can cause an error even while DoingCommandRead == true. Is that safe?

I read codes again and I think it is not safe. It is OK when whereToSendOutput is DestRemote,
but not good in InteractiveBackend().

I changed that if-statement for CheckingRemoteServersTimeoutPending is moved just after
ClientConnectionLost, because the that may throw a FATAL error and disconnect from client.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/02/01 13:37, kuroda.hayato@fujitsu.com wrote:
> Dear Fujii-san,
> 
> Thank you for reviewing! I attached the latest version.

Thanks!

> Indeed and it should be avoided. I added a counter to CheckingRemoteServersCallbackItem.
> The register function returns the registered item, and it must be set as the argument for
> enable and disable functions.
> Callback functions will be called when item->counter is larger than zero.

This logic sounds complicated to me. I'm afraid that FDW developers may a bit easily misunderstand the logic and make
thebug in their FDW.
 

Isn't it simpler to just disable the timeout in core whenever the transaction ends whether committed or aborted, like
statement_timeoutis disabled after each command? For example, call something like DisableForeignCheckTimeout() in
CommitTransaction()etc.
 

> You are right. I think this suggests that error-reporting should be done in the core layer.
> For meaningful error reporting, I changed a callback specification
> that it should return ForeignServer* which points to downed remote server.

This approach seems to assume that FDW must manage all the ForeignServer information so that the callback can return
it.Is this assumption valid for all the FDW?
 

How about making FDW trigger a query cancel interrupt by signaling SIGINT to the backend, instead?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

Thank you for good suggestions.

> This logic sounds complicated to me. I'm afraid that FDW developers may a bit
> easily misunderstand the logic and make the bug in their FDW.
> Isn't it simpler to just disable the timeout in core whenever the transaction ends
> whether committed or aborted, like statement_timeout is disabled after each
> command? For example, call something like DisableForeignCheckTimeout() in
> CommitTransaction() etc.

Your idea is that stop the timer at the end of each transactions, right?
I had not adopted that because I thought some developers might want not to stop the timer
even if transactions ends. It caused complexed situation and not have good usecase, however,
so your logic was implemented.

> > You are right. I think this suggests that error-reporting should be done in the
> core layer.
> > For meaningful error reporting, I changed a callback specification
> > that it should return ForeignServer* which points to downed remote server.
>
> This approach seems to assume that FDW must manage all the ForeignServer
> information so that the callback can return it. Is this assumption valid for all the
> FDW?

Not sure, the assumption might be too optimistic.
mysql_fdw can easily return ForeignServer* because it caches serverid,
but dblink and maybe oracle_fdw cannot.

> How about making FDW trigger a query cancel interrupt by signaling SIGINT to
> the backend, instead?

I understood that the error should be caught by QueryCancelPending.
Could you check 0003? Does it follow that?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Hackers,

I found patches we depend have been committed, so rebased.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=50e570a59e7f86bb41f029a66b781fc79b8d50f1

In this version there is a little bit change in part of postgres_fdw.
A system checking by WaitEventSetCanReportClosed() is added
because some OSes cannot wait WL_SOCKET_CLOSED event.
Note that test cannot be added in the regression test
because cfbot may be not happy.
In my environment a test that contained in previous patches works well.

0001, 0002 is not changed from previous version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Kyotaro Horiguchi
Date:
At Tue, 1 Feb 2022 23:51:54 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> This logic sounds complicated to me. I'm afraid that FDW developers
> may a bit easily misunderstand the logic and make the bug in their
> FDW.

> Isn't it simpler to just disable the timeout in core whenever the
> transaction ends whether committed or aborted, like statement_timeout
> is disabled after each command? For example, call something like
> DisableForeignCheckTimeout() in CommitTransaction() etc.

> This approach seems to assume that FDW must manage all the
> ForeignServer information so that the callback can return it. Is this
> assumption valid for all the FDW?

FWIW, I'm not sure this feature necessarily requires core support
dedicated to FDWs.  The core have USER_TIMEOUT feature already and
FDWs are not necessarily connection based.  It seems better if FDWs
can implement health check feature without core support and it seems
possible.  Or at least the core feature should be more generic and
simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
something and operating functions on it?


> How about making FDW trigger a query cancel interrupt by signaling
> SIGINT to the backend, instead?

Mmm. AFAICS the running command will stop with "canceling statement
due to user request", which is a hoax.  We need a more decent message
there.

I understand that the motive of this patch is "to avoid wasted long
local work when fdw-connection dies".  In regard to the workload in
your first mail, it is easily avoided by ending the transaction as soon
as remote access ends. This feature doesn't work for the case "begin;
<long local query>; <fdw access>". But the same measure also works in
that case.  So the only case where this feature is useful is "begin;
<fdw-access>; <some long work>; <fdw-access>; end;".  But in the first
place how frequently do you expecting remote-connection close happens?
If that happens so frequently, you might need to recheck the system
health before implementing this feature.  Since it is correctly
detected when something really went wrong, I feel that it is a bit too
complex for the usefulness especially for the core part.


In conclusion, as my humble opinion I would like to propose to reduce
this feature to:

- Just periodically check health (in any aspect) of all live
  connections regardless of the session state.

- If an existing connection is found to be dead, just try canceling
  the query (or sending query cancel).

One issue with it is how to show the decent message for the query
cancel, but maybe we can have a global variable that suggests the
reason for the cancel.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Horiguchi-san,

Thank you for giving your suggestions. I want to confirm your saying.

> FWIW, I'm not sure this feature necessarily requires core support
> dedicated to FDWs.  The core have USER_TIMEOUT feature already and
> FDWs are not necessarily connection based.  It seems better if FDWs
> can implement health check feature without core support and it seems
> possible.  Or at least the core feature should be more generic and
> simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
> something and operating functions on it?

I understood that core is too complicated and FDW side is too stupid, right?

> Mmm. AFAICS the running command will stop with "canceling statement
> due to user request", which is a hoax.  We need a more decent message
> there.

+1 about better messages.

> I understand that the motive of this patch is "to avoid wasted long
> local work when fdw-connection dies".

Yeah your understanding is right.

> In regard to the workload in
> your first mail, it is easily avoided by ending the transaction as soon
> as remote access ends. This feature doesn't work for the case "begin;
> <long local query>; <fdw access>". But the same measure also works in
> that case.  So the only case where this feature is useful is "begin;
> <fdw-access>; <some long work>; <fdw-access>; end;".  But in the first
> place how frequently do you expecting remote-connection close happens?
> If that happens so frequently, you might need to recheck the system
> health before implementing this feature.  Since it is correctly
> detected when something really went wrong, I feel that it is a bit too
> complex for the usefulness especially for the core part.

Thanks for analyzing motivation.
Indeed, some cases may be resolved by separating tx and this event rarely happens.

> In conclusion, as my humble opinion I would like to propose to reduce
> this feature to:
>
> - Just periodically check health (in any aspect) of all live
>   connections regardless of the session state.

I understood here as removing following mechanism from core:

* disable timeout at end of tx.
* skip if held off or read commands

> - If an existing connection is found to be dead, just try canceling
>   the query (or sending query cancel).
> One issue with it is how to show the decent message for the query
> cancel, but maybe we can have a global variable that suggests the
> reason for the cancel.

Currently I have no good idea for that but I'll try.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
> I understood here as removing following mechanism from core:
>
> * disable timeout at end of tx.

While reading again and this part might be wrong.
Sorry for inconvenience.
But anyway some codes should be (re)moved from core, right?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [Proposal] Add foreign-server health checks infrastructure

From
Kyotaro Horiguchi
Date:
Hi, Kuroda-san.

At Thu, 17 Feb 2022 04:11:09 +0000, "kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote in 
> Dear Horiguchi-san,
> 
> Thank you for giving your suggestions. I want to confirm your saying.
> 
> > FWIW, I'm not sure this feature necessarily requires core support
> > dedicated to FDWs.  The core have USER_TIMEOUT feature already and
> > FDWs are not necessarily connection based.  It seems better if FDWs
> > can implement health check feature without core support and it seems
> > possible.  Or at least the core feature should be more generic and
> > simpler. Why don't we just expose InTransactionHealthCheckCallbacks or
> > something and operating functions on it?
> 
> I understood that core is too complicated and FDW side is too stupid, right?

I don't think the FDW side is stupid but seem too complex for the
benefit.  And just think that maybe we don't need the core part.

> > Mmm. AFAICS the running command will stop with "canceling statement
> > due to user request", which is a hoax.  We need a more decent message
> > there.
> 
> +1 about better messages.
> 
> > I understand that the motive of this patch is "to avoid wasted long
> > local work when fdw-connection dies".
> 
> Yeah your understanding is right.
> 
> > In regard to the workload in
> > your first mail, it is easily avoided by ending the transaction as soon
> > as remote access ends. This feature doesn't work for the case "begin;
> > <long local query>; <fdw access>". But the same measure also works in
> > that case.  So the only case where this feature is useful is "begin;
> > <fdw-access>; <some long work>; <fdw-access>; end;".  But in the first
> > place how frequently do you expecting remote-connection close happens?
> > If that happens so frequently, you might need to recheck the system
> > health before implementing this feature.  Since it is correctly
> > detected when something really went wrong, I feel that it is a bit too
> > complex for the usefulness especially for the core part.
> 
> Thanks for analyzing motivation.
> Indeed, some cases may be resolved by separating tx and this event rarely happens.
> 
> > In conclusion, as my humble opinion I would like to propose to reduce
> > this feature to:
> > 
> > - Just periodically check health (in any aspect) of all live
> >   connections regardless of the session state.
> 
> I understood here as removing following mechanism from core:
> 
> * disable timeout at end of tx.
> * skip if held off or read commands

I think we're on the same page. Anyway query cancel interrupt is
ignored while rading input.

> > - If an existing connection is found to be dead, just try canceling
> >   the query (or sending query cancel).
> > One issue with it is how to show the decent message for the query
> > cancel, but maybe we can have a global variable that suggests the
> > reason for the cancel.
> 
> Currently I have no good idea for that but I'll try.

However, I would like to hear others' opnions about the direction, of
course.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [Proposal] Add foreign-server health checks infrastructure

From
Kyotaro Horiguchi
Date:
At Thu, 17 Feb 2022 04:32:26 +0000, "kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote in 
> > I understood here as removing following mechanism from core:
> > 
> > * disable timeout at end of tx.
> 
> While reading again and this part might be wrong.
> Sorry for inconvenience.
> But anyway some codes should be (re)moved from core, right?

I think we just don't need to add the special timeout kind to the
core.  postgres_fdw can use USER_TIMEOUT and it would be suffiction to
keep running health checking regardless of transaction state then fire
query cancel if disconnection happens. As I said in the previous main,
possible extra query cancel woud be safe.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Horiguchi-san,

> I think we just don't need to add the special timeout kind to the
> core.  postgres_fdw can use USER_TIMEOUT and it would be suffiction to
> keep running health checking regardless of transaction state then fire
> query cancel if disconnection happens. As I said in the previous main,
> possible extra query cancel woud be safe.

I finally figured out that you mentioned about user-defined timeout system.
Firstly - before posting to hackers - I designed like that,
but I was afraid of an overhead that many FDW registers timeout
and call setitimer() many times. Is it too overcautious?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/02/17 19:35, kuroda.hayato@fujitsu.com wrote:
> Dear Horiguchi-san,
> 
>> I think we just don't need to add the special timeout kind to the
>> core.  postgres_fdw can use USER_TIMEOUT and it would be suffiction to
>> keep running health checking regardless of transaction state then fire
>> query cancel if disconnection happens. As I said in the previous main,
>> possible extra query cancel woud be safe.

Sounds reasonable to me.


> I finally figured out that you mentioned about user-defined timeout system.
> Firstly - before posting to hackers - I designed like that,
> but I was afraid of an overhead that many FDW registers timeout
> and call setitimer() many times. Is it too overcautious?

Isn't it a very special case where many FDWs use their own user timeouts? Could you tell me the assumption that you're
thinking,especially how many FDWs are working?
 

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

> Isn't it a very special case where many FDWs use their own user timeouts? Could
> you tell me the assumption that you're thinking, especially how many FDWs are
> working?

I came up with the case like star schema, which postgres database connects data store.
If each dbms are different and FDWs have each timeout, many timeout will be registered.
But it may be a corner case and should not be confused with OLTP case. 
So I'll post new patch based on his post.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Horiguchi-san, Fujii-san,

> > I understood here as removing following mechanism from core:
> >
> > * disable timeout at end of tx.
> > * skip if held off or read commands
>
> I think we're on the same page. Anyway query cancel interrupt is
> ignored while rading input.
>
> > > - If an existing connection is found to be dead, just try canceling
> > >   the query (or sending query cancel).
> > > One issue with it is how to show the decent message for the query
> > > cancel, but maybe we can have a global variable that suggests the
> > > reason for the cancel.
> >
> > Currently I have no good idea for that but I'll try.
>
> However, I would like to hear others' opnions about the direction, of
> course.
>

Based on the idea, I re-implemented the feature. Almost all feature is
moved to postgres_fdw. The abstract of my patch is as follows:

# core

* Exposes QueryCancelMessage for error reporting
* Uses above if it was not NULL

# postgres_fdw

* Defines new GUC postgres_fdw.health_check_interval.
  It is renamed simpler.
* Registers a timeout when initializing connection hash.
* Raises SIGINT and sets QueryCancelMessage to message.
  if detects a connection lost.

I also attached a test as zipped file for keep cfbot quiet.
When connection lost is detected, the following message will show:

```
ERROR:  Foreign Server (servername) might be down.
```

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/02/22 11:53, kuroda.hayato@fujitsu.com wrote:
> How do you think?

Thanks for updating the patches! I will read them.

cfbot is reporting that the 0002 patch fails to be applied cleanly. Could you update the patch?
http://cfbot.cputube.org/patch_37_3388.log

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

> cfbot is reporting that the 0002 patch fails to be applied cleanly. Could you update
> the patch?
> http://cfbot.cputube.org/patch_37_3388.log

Thanks for reporting and sorry for inconvenience.
I repo was not latest version. Attached can be applied to 52e4f0c

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Cfbot is still angry because of missing PGDLLIMPORT, so attached.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/02/22 15:41, kuroda.hayato@fujitsu.com wrote:
> Cfbot is still angry because of missing PGDLLIMPORT, so attached.

Thanks for updating the patches!

The connection check timer is re-scheduled repeatedly even while the backend is in idle state or is running a local
transactionthat doesn't access to any foreign servers. I'm not sure if it's really worth checking the connections even
inthose states. Even without the periodic connection checks, if the connections are closed in those states, subsequent
GetConnection()will detect that closed connection and re-establish the connection when starting remote transaction.
Thought?


When a closed connection is detected in idle-in-transaction state and SIGINT is raised, nothing happens because there
isno query running to be canceled by SIGINT. Also in this case the connection check timer gets disabled. So we can
stillexecute queries that don't access to foreign servers, in the same transaction, and then the transaction commit
fails.Is this expected behavior?
 


When I shutdowned the foreign server while the local backend is in idle-in-transaction state, the connection check
timerwas triggered and detected the closed connection. Then when I executed COMMIT command, I got the following WARNING
message.Is this a bug?
 

     WARNING:  leaked hash_seq_search scan for hash table 0x7fd2ca878f20

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

Thank you for your quick reviewing! I attached new version.
I found previous patches have wrong name. Sorry.

> The connection check timer is re-scheduled repeatedly even while the backend is
> in idle state or is running a local transaction that doesn't access to any foreign
> servers. I'm not sure if it's really worth checking the connections even in those
> states. Even without the periodic connection checks, if the connections are closed
> in those states, subsequent GetConnection() will detect that closed connection
> and re-establish the connection when starting remote transaction. Thought?

Indeed. We can now control the timer in fdw layer, so disable_timeout() was added
at the bottom of pgfdw_xact_callback(). 

> When a closed connection is detected in idle-in-transaction state and SIGINT is
> raised, nothing happens because there is no query running to be canceled by
> SIGINT. Also in this case the connection check timer gets disabled. So we can still
> execute queries that don't access to foreign servers, in the same transaction, and
> then the transaction commit fails. Is this expected behavior?

It's not happy, but I'm not sure about a good solution. I made a timer reschedule
if connection lost had detected. But if queries in the transaction are quite short,
catching SIGINT may be fail.

> When I shutdowned the foreign server while the local backend is in
> idle-in-transaction state, the connection check timer was triggered and detected
> the closed connection. Then when I executed COMMIT command, I got the
> following WARNING message. Is this a bug?
> 
>      WARNING:  leaked hash_seq_search scan for hash table 0x7fd2ca878f20

Fixed. It is caused because hash_seq_term() was not called when checker detects
a connection lost.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Hi Hackers,

> It's not happy, but I'm not sure about a good solution. I made a timer reschedule
> if connection lost had detected. But if queries in the transaction are quite short,
> catching SIGINT may be fail.

Attached uses another way: sets pending flags again if DoingCommandRead is true.
If a remote server goes down while it is in idle_in_transaction,
next query will fail because of ereport(ERROR).

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2022/03/04 15:17, kuroda.hayato@fujitsu.com wrote:
> Hi Hackers,
> 
>> It's not happy, but I'm not sure about a good solution. I made a timer reschedule
>> if connection lost had detected. But if queries in the transaction are quite short,
>> catching SIGINT may be fail.
> 
> Attached uses another way: sets pending flags again if DoingCommandRead is true.
> If a remote server goes down while it is in idle_in_transaction,
> next query will fail because of ereport(ERROR).
> 
> How do you think?

Sounds ok to me.

Thanks for updating the patches!

These failed to be applied to the master branch cleanly. Could you update them?

+      this option relies on kernel events exposed by Linux, macOS,

s/this/This

+        GUC_check_errdetail("pgfdw_health_check_interval must be set to 0 on this platform");

The actual parameter name "postgres_fdw.health_check_interval"
should be used for the message instead of internal variable name.

+        /* Register a timeout for checking remote servers */
+        pgfdw_health_check_timeout = RegisterTimeout(USER_TIMEOUT, pgfdw_connection_check);

This registered signal handler does lots of things. But that's not acceptable
and they should be performed outside signal handler. No?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Fujii-san,

Thanks for checking!

> These failed to be applied to the master branch cleanly. Could you update them?

PSA rebased patches. I reviewed my myself and they contain changes.
E.g., move GUC-related code to option.c.


> +      this option relies on kernel events exposed by Linux, macOS,
> 
> s/this/This

Fixed.

> 
> +        GUC_check_errdetail("pgfdw_health_check_interval must be set
> to 0 on this platform");
> 
> The actual parameter name "postgres_fdw.health_check_interval"
> should be used for the message instead of internal variable name.

Fixed.

> This registered signal handler does lots of things. But that's not acceptable
> and they should be performed outside signal handler. No?


I modified like v09 or earlier versions, which has a mechanism for registering CheckingRemoteServersCallback.
It had been removed because we want to keep core simpler, but IIUC it is needed
if the signal handler just sets some flags.
The core-side does not consider the current status of transaction and running query for simpleness.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Andres Freund
Date:
Hi,

On 2022-09-21 11:56:56 +0000, kuroda.hayato@fujitsu.com wrote:
> PSA rebased patches. I reviewed my myself and they contain changes.
> E.g., move GUC-related code to option.c.

This seems to reliably fail on windows. See
https://cirrus-ci.com/task/6454408568373248
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/39/3388


https://api.cirrus-ci.com/v1/artifact/task/6454408568373248/testrun/build/testrun/postgres_fdw/regress/regression.diffs

diff -w -U3 C:/cirrus/contrib/postgres_fdw/expected/postgres_fdw.out
C:/cirrus/build/testrun/postgres_fdw/regress/results/postgres_fdw.out
--- C:/cirrus/contrib/postgres_fdw/expected/postgres_fdw.out    2022-10-02 14:47:24.486355800 +0000
+++ C:/cirrus/build/testrun/postgres_fdw/regress/results/postgres_fdw.out    2022-10-02 15:02:03.039752800 +0000
@@ -11478,6 +11478,8 @@
 ALTER SERVER loopback OPTIONS (SET application_name 'healthcheck');
 -- Set GUC for checking the health of remote servers
 SET postgres_fdw.health_check_interval TO '1s';
+ERROR:  invalid value for parameter "postgres_fdw.health_check_interval": 1000
+DETAIL:  postgres_fdw.health_check_interval must be set to 0 on this platform
 BEGIN;
 SELECT 1 FROM ft1 LIMIT 1;
  ?column?
@@ -11495,7 +11497,15 @@

 -- While sleeping the process down will be detected.
 SELECT pg_sleep(3);
-ERROR:  Foreign Server loopback might be down.
+ pg_sleep
+----------
+
+(1 row)
+
 COMMIT;
+ERROR:  server closed the connection unexpectedly
+    This probably means the server terminated abnormally
+    before or while processing the request.
+CONTEXT:  remote SQL command: COMMIT TRANSACTION
 -- Clean up
 RESET debug_discard_caches;


Greetings,

Andres Freund



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Andres,

> This seems to reliably fail on windows. See

Thanks for reporting. Actually this feature cannot be used on Windows machine.
To check the status of each socket that connects to the foreign server,
the socket event WL_SOCKET_CLOSED is used.
The event is only enabled on some OSes, and Windows machine cannot.

The part must be skipped if the system cannot be used the event, but I was not sure how to do that...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




Re: [Proposal] Add foreign-server health checks infrastructure

From
Önder Kalacı
Date:
Hi Hayato Kuroda,

Thanks for the patch. I think there are some non-fdw extensions out there which could benefit from this logic. That's why I want to first learn some more about high-level design/goals of the patch a little more.

+/*
+ * Call callbacks for checking remote servers.
+ */
+void
+CallCheckingRemoteServersCallbacks(void)

Why do we run this periodically, but not when a specific connection is going to be used? Wouldn't running it periodically prevent detecting some already-closed sockets at the time of the connection used (e.g., we checked 10 seconds ago, the server died 5 seconds ago)?

In other words, what is the trade-off for calling pgfdw_connection_check_internal() inside GetConnection() when we are about to use a "cached" connection? I think that might simplify the patch as well.

+/*
+ * Helper function for pgfdw_connection_check
+ */
+static bool
+pgfdw_connection_check_internal(PGconn *conn)
+{

Can we have this function/logic on Postgres core, so that other extensions can also use?

 + AddWaitEventToSet(eventset, WL_SOCKET_CLOSED, PQsocket(conn), NULL, NULL);

What if PQsocket(conn) returns -1? Maybe we move certain connection state checks into pgfdw_connection_check_internal() such that it is more generic? I can think of checks like: conn!=NULL, PQsocket(conn) != PGINVALID_SOCKET, PQstatus == CONNECTION_OK


+ eventset = CreateWaitEventSet(CurrentMemoryContext, 1);
+ AddWaitEventToSet(eventset, WL_SOCKET_CLOSED, PQsocket(conn), NULL, NULL);
+
+ WaitEventSetWait(eventset, 0, &events, 1, 0);
+
+ if (events.events & WL_SOCKET_CLOSED)
+ {
+ FreeWaitEventSet(eventset);
+ return false;
+ }
+ FreeWaitEventSet(eventset);

Do you see any performance implications of creating/freeing waitEventSets per check? I wonder if we can somehow re-use the same waitEventSet by modifyWaitEvent? I guess no, but still, if this check causes a performance implication, can we somehow cache 1 waitEventSet per connection?


Thanks,
Onder KALACI
Developing the Citus extension @Microsoft

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Önder,

Thank you for being interest to my patch! Your suggestions will be included to newer version.

> In other words, what is the trade-off for calling
> pgfdw_connection_check_internal() inside GetConnection() when we are about
> to use a "cached" connection? I think that might simplify the patch as well

If the checking function is called not periodically but GetConnection(),
it means that the health of foreign servers will be check only when remote connections are used.
So following workload described in [1] cannot handle the issue.

BEGIN --- remote operations--- local operations --- COMMIT

But, yes, I perfectly agreed that it could simplify the code
because we can reduce the timer part. This is second plan of this patch,
I may move on this approach if it is still useful.

> Can we have this function/logic on Postgres core, so that other extensions
> can also use?

I was not sure about any use-case, but I think it can because it does quite general things.
Is there any good motivation to do that?

> What if PQsocket(conn) returns -1? Maybe we move certain connection state
> checks into pgfdw_connection_check_internal() such that it is more generic?
> I can think of checks like: conn!=NULL, PQsocket(conn) != PGINVALID_SOCKET,
> PQstatus == CONNECTION_OK

ereport(ERROR) will be thrown if PQsocket(conn) returns -1.
All of you said should be handled here. I will modify it.

> Do you see any performance implications of creating/freeing waitEventSets
> per check? I wonder if we can somehow re-use the same waitEventSet by
> modifyWaitEvent? I guess no, but still, if this check causes a performance
> implication, can we somehow cache 1 waitEventSet per connection?

I have not tested yet, but I agreed this will be caused performance decrease.
In next version first I will re-use the event set anyway, and it must be considered later.
Actually I'm not sure your suggestion,
but you mean to say that we can add a hash table that associates  PGconn and WaitEventSet,  right?

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58662809E678253B90E82CE5F5889%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Önder Kalacı
Date:
Hi Hayato Kuroda,

 
If the checking function is called not periodically but GetConnection(),
it means that the health of foreign servers will be check only when remote connections are used.
So following workload described in [1] cannot handle the issue.

BEGIN --- remote operations--- local operations --- COMMIT


As far as I can see this patch is mostly useful for detecting the failures on the initial remote command. This is especially common when the remote server does a failover/switchover and postgres_fdw uses a cached connection to access to the remote server.

The remote server failures within a transaction block sounds like a much less common problem. Still, you can give a good error message before COMMIT is sent to the remote server.
 
I agree that this doesn't solve the issue you described, but I'm not sure if it is worthwhile to fix such a problem.


> Can we have this function/logic on Postgres core, so that other extensions
> can also use?

I was not sure about any use-case, but I think it can because it does quite general things.
Is there any good motivation to do that?

I think any extension that deals with multiple Postgres nodes can benefit from such logic. In fact, the reason I realized this patch is that on the Citus extension, we are trying to solve a similar problem [1], [2].

Thinking even more, I think any extension that uses libpq and WaitEventSets can benefit from such a function. 


> Do you see any performance implications of creating/freeing waitEventSets
> per check? I wonder if we can somehow re-use the same waitEventSet by
> modifyWaitEvent? I guess no, but still, if this check causes a performance
> implication, can we somehow cache 1 waitEventSet per connection?

I have not tested yet, but I agreed this will be caused performance decrease.
In next version first I will re-use the event set anyway, and it must be considered later.
Actually I'm not sure your suggestion,
but you mean to say that we can add a hash table that associates  PGconn and WaitEventSet,  right?


I think it also depends on where you decide to put pgfdw_connection_check_internal(). If you prefer the postgres_fdw side, could we maybe use ConnCacheEntry in contrib/postgres_fdw/connection.c? 

But if you decide to put it into the Postgres side, the API for pgfdw_connection_check_internal() -- or equivalent function -- could be discussed. Do we pass a WaitEventSet and if it is NULL create a new one, else use what is passed to the function? Not sure, maybe you can come up with a better API.
 
Thanks,
Onder KALACI

RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Önder,

> As far as I can see this patch is mostly useful for detecting the failures
> on the initial remote command. This is especially common when the remote
> server does a failover/switchover and postgres_fdw uses a cached connection
> to access to the remote server.

Sounds reasonable. Do you mean that we can add additional GUC like "postgres_fdw.initial_check",
wait WL_SOCKET_CLOSED if the conneciton is found in the hash table, and do reconnection if it might be closed, right?

> I think any extension that deals with multiple Postgres nodes can benefit
> from such logic. In fact, the reason I realized this patch is that on the
> Citus extension, we are trying to solve a similar problem [1], [2].

> Thinking even more, I think any extension that uses libpq and WaitEventSets
> can benefit from such a function.

OK, I agreed it may be useful, but where should this function be?
This cannot be used from application, so it should not in interface/libpq dir. So backend/libpq/pqcomm.c?

> I think it also depends on where you decide to put
> pgfdw_connection_check_internal(). If you prefer the postgres_fdw side,
> could we maybe use ConnCacheEntry in contrib/postgres_fdw/connection.c?
> But if you decide to put it into the Postgres side, the API
> for pgfdw_connection_check_internal() -- or equivalent function -- could be
> discussed. Do we pass a WaitEventSet and if it is NULL create a new one,
> else use what is passed to the function? Not sure, maybe you can come up
> with a better API.

Thank you for describing more detail. I can imagine you said.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Önder, Fujii-san,

Thanks for giving many comments! Based on off and on discussions, I modified my patch.
Note that this patch-set did not contain all of fixes you said.

[Modified]

1.
QueryCancelPending was cleaned up when the query cancel was skipped.
In the previous version the QueryCancelPending was set to true again even if the ereport(ERROR) was skipped.
It was because I thought the transaction that accessed to foreign servers but one of them crashed should be aborted
immediately.
But currenlty postgres does not have such "delayed" query cancel mechanism, so I followed the manner.

Another approach is documenting like "If QueryCancelPending is already filled, the callback function must send SIGINT
signal".
This seems simpler for us, but it may be troublesome for FDW authors. How do you think?

2.
The new filed servername was added to the hash entry. It was needed
because the server name must be logged in the pgfdw_connection_check(), but according to [1],
the catalog should not be read in ProcessInterrupts() to avoid the race condition.
The string is pstrdup()'d when the entry has been created, and replaced when ALTER SERER RENAME has been executed.

3.
The manner to implement the health check function was documented.

4.
raise() was replaced to kill(), because it was not used in core code. 

5.
APIs to handle QueryCancelMessage were added to avoid overriding the message.

6.
Some checks for socket was added in the checking function.

[1]: https://www.postgresql.org/message-id/CA%2BTgmoYW_rSOW4JMQ9_0Df9PKQ%3DsQDOKUGA4Gc9D8w4wui8fSA%40mail.gmail.com



[Unmodified]

a.
The checking function is still in the postgres_fdw.
This is because I could not find any good file to move this function.
IIUC this function needs PGconn as an argument,
but it seems that it is declared in libpq-fe.h and it should not be included from core. 

b.
In the checking function, the eventset is still created, added, and freed.
Firstly I tried to declare eventset as inner-function static variable and reuse that, but I could not do. This is
because:

* The size of eventset cannot be determined when the function is called for the first time.
  The size must be set in CreateWaitEventSet(), and it must be more than the number of remote connection.
* To reuse the eventset the added event must be removed every time, but it cannot.
  The API ModifyWaitEvent() can be only used for "modifying" the event, not "removing".
  If the function is called as events=0, we will get assertion failure in WaitEventAdjustEpoll() or something.

To avoid creating/freeing the event set, hash table must be modified but it has not done yet.

c.
Currently the status of socket is not checked in the GetConnection().
I agreed it may be useful, but I thought it might be out of scope. It could not meet the requirements.
I think it can be added separately.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Önder Kalacı
Date:
Hi,

Sounds reasonable. Do you mean that we can add additional GUC like "postgres_fdw.initial_check",
wait WL_SOCKET_CLOSED if the conneciton is found in the hash table, and do reconnection if it might be closed, right?


Alright, it took me sometime to realize that postgres_fdw already has a retry mechanism if the first command fails: postgres_fdw: reestablish new connection if cached one is detected as… · postgres/postgres@32a9c0b (github.com) 

Still, the reestablish mechanism can be further simplified with WL_SOCKET_CLOSED event such as the following (where we should probably rename pgfdw_connection_check_internal):

 /*
* If the connection needs to be remade due to invalidation, disconnect as
* soon as we're out of all transactions.
*/
 | +bool remoteSocketIsClosed =  entry->conn != NULL : pgfdw_connection_check_internal(entry->conn) : false;
if (entry->conn != NULL && (entry->invalidated || remoteSocketIsClosed) && entry->xact_depth == 0)
{
elog(DEBUG3, "closing connection %p for option changes to take effect",
entry->conn);
disconnect_pg_server(entry);
}
 | +else if (remoteSocketIsClosed && && entry->xact_depth > 0)
 | + error ("Remote Server is down ...")


In other words, a variation of pgfdw_connection_check_internal() could potentially go into interfaces/libpq/libpq-fe.h (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c). Then, GetConnection() in postgres_fdw, it can force to reconnect as it is already done for some cases or error properly:
 

 Based on off and on discussions, I modified my patch.


I still think that it is probably too much work/code to detect the mentioned use-case you described on [1]. Each backend constantly calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound the optimal way to approach the "check whether server down" problem. You typically try to decide whether a server is down by establishing a connection (or ping etc), not going over all the existing connections. 

As far as I can think of, it should probably be a single background task checking whether the server is down. If so, sending an invalidation message to all the backends such that related backends could act on the invalidation and throw an error. This is to cover the use-case you described on [1].

Also, maybe we could have a new catalog table like pg_foreign_server_health or such, where we can keep the last time the check succeeded (and/or failed), and how many times the check  succeeded (and/or failed).

This is of course how I would approach this problem. I think some other perspectives on this would be very useful to hear.

Thanks,
Onder KALACI



 

RE: [Proposal] Add foreign-server health checks infrastructure

From
"osumi.takamichi@fujitsu.com"
Date:
Hi,


On Wednesday, October 5, 2022 6:27 PM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote:
> Thanks for giving many comments! Based on off and on discussions, I modified
> my patch.
Thank you for your patch set !

While reviewing and testing the new v16, I've met a possible issue
by slightly adjusting the scenario written in [1].

I mainly followed the steps there and
replaced the command "SELECT" for the remote table at 6-9 with "INSERT" command.
Then, after waiting for few seconds, the "COMMIT" succeeded like below output,
even after the server stop of the worker side.

After the transaction itself, any reference to the remote table fails.
Note that the local table has some data initially in my test.

postgres=# begin;
BEGIN
postgres=*# insert into remote values (-1000);
INSERT 0 1
postgres=*# select * from local;
 number 
--------
    101
    102
    103
    104
    105
(5 rows)

postgres=*# commit;
COMMIT
postgres=# select * from remote;
ERROR:  could not connect to server "my_external_server"
DETAIL:  connection to server on socket "/tmp/.s.PGSQL.9999" failed: No such file or directory
        Is the server running locally and accepting connections on that socket?


Additionally, the last reference "SELECT" which failed above can succeed,
if I restart the worker server before the "SELECT" command to the remote table.
This means the transaction looks successful but the data isn't there ?
Could you please have a look at this issue ?


[1] -
https://www.postgresql.org/message-id/TYAPR01MB58662CD4FD98AA475B3D10F9F59B9%40TYAPR01MB5866.jpnprd01.prod.outlook.com


Best Regards,
    Takamichi Osumi


RE: [Proposal] Add foreign-server health checks infrastructure

From
"osumi.takamichi@fujitsu.com"
Date:
On Wednesday, October 5, 2022 6:27 PM kuroda.hayato@fujitsu.com <kuroda.hayato@fujitsu.com> wrote:
> Thanks for giving many comments! Based on off and on discussions, I modified
> my patch.
Here are my other quick review comments on v16.


(1) v16-0001 : definition of a new structure

CheckingRemoteServersCallbackItem can be added to typedefs.list.


(2) v16-0001 : UnRegisterCheckingRemoteServersCallback's header comment

+void
+UnRegisterCheckingRemoteServersCallback(CheckingRemoteServersCallback callback,
+                                                                               void *arg)
+{

Looks require a header comment for this function,
because in this file, all other functions have each one.


(3) v16-0002 : better place to declare new variables

New variables 'old' and 'server' in GetConnection() can be moved to
a scope of 'if' statement where those are used.

@@ -138,6 +149,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
        ConnCacheEntry *entry;
        ConnCacheKey key;
        MemoryContext ccxt = CurrentMemoryContext;
+       MemoryContext old;
+       ForeignServer *server = GetForeignServer(user->serverid);

(4) v16-0002 : typo in pgfdw_connection_check_internal()


+       /* return false is if the socket seems to be closed. */

It should be "return false if the socket seems to be closed" ?


(5) v16-0002 : pgfdw_connection_check's message

When I tested the new feature, I got a below message.

"ERROR:  Foreign Server my_external_server might be down."

I think we can wrap the server name with double quotes
so that the message is more aligned with existing error message
like connect_pg_server.


(6) v16-003 : typo

+      Authors needs to register the callback function via following API.

Should be "Authors need to register the ...".


(7) v16-0004 : unnecessary file

I think we can remove a new file which looks like a log file.

.../postgres_fdw/expected/postgres_fdw_1.out



Best Regards,
    Takamichi Osumi


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Önder,

Thanks for giving suggestions!

> Still, the reestablish mechanism can be further simplified with
> WL_SOCKET_CLOSED event such as the following (where we should probably
> rename pgfdw_connection_check_internal):

Sounds reasonable.
I think it may be included in this patch. I will try to next (or later) version.

> In other words, a variation of pgfdw_connection_check_internal()
> could potentially go into interfaces/libpq/libpq-fe.h
> (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c).

Hmm, IIUC libpq related function and data structures cannot be accessed from core source,
so we cannot move to pqcomm.c.
(This is a motivation for introducing libpqwalreceiver library. It is used to avoid to link libpq directly)
And functions in libpq/fe-connect.c will be included libpq.so,
but latch related functions like WaitEventSetWait() should not be called from client application.
So it is also not appropriate.
In short, there are no good position to place the function because this requires both of libpq and core functions.


> I still think that it is probably too much work/code to detect the
> mentioned use-case you described on [1]. Each backend constantly
> calling CallCheckingRemoteServersCallbacks() for this purpose doesn't sound
> the optimal way to approach the "check whether server down" problem. You
> typically try to decide whether a server is down by establishing a
> connection (or ping etc), not going over all the existing connections.

Yes, the approach that establishes a new connection is very simple, but I think it has some holes.
For example, if the DNS server or some routing software may is stopped,
we will fail to connect to foreign servers.
In your approach, we regard the case as "failed" and try to invalidate the server
even if the existing connection can be used.
Another case is that if a server goes down and failover has occurred, we will succeed to connect to foreign servers.
We may regard the case as a "success", but we cannot COMMIT the transaction.


> As far as I can think of, it should probably be a single background task
> checking whether the server is down. If so, sending an invalidation message
> to all the backends such that related backends could act on the
> invalidation and throw an error. This is to cover the use-case you
> described on [1].

Indeed your approach covers the use case I said, but I'm not sure whether it is really good. 
In your approach, once the background worker process will manage all foreign servers.
It may be OK if there are a few servers, but if there are hundreds of servers,
the time interval during checks will be longer.

Currently, each FDW can decide whether we do health checks or not per the backend.
For example, we can skip health checks if the foreign server is not used now.
The background worker cannot control such a way.

Moreover, methods to connect to foreign servers and check health are different per FDW.
In terms of mysql_fdw [1], we must do mysql_init() and mysql_real_connect().
About file_fdw, we do not have to connect, but developers may want to calculate checksum and compare.
Therefore, we must provide callback functions anyway.

Based on the above, I do not agree that we introduce a new background worker and make it to do a health check.

> This is of course how I would approach this problem. I think some other
> perspectives on this would be very useful to hear.

Yes, we can hear other opinions :-).

[1]: https://github.com/EnterpriseDB/mysql_fdw

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

 

Re: [Proposal] Add foreign-server health checks infrastructure

From
Kyotaro Horiguchi
Date:
At Mon, 17 Oct 2022 07:27:21 +0000, "kuroda.hayato@fujitsu.com" <kuroda.hayato@fujitsu.com> wrote in
> > In other words, a variation of pgfdw_connection_check_internal()
> > could potentially go into interfaces/libpq/libpq-fe.h
> > (backend/libpq/pqcomm.c or src/interfaces/libpq/fe-connect.c).
>
> Hmm, IIUC libpq related function and data structures cannot be accessed from core source,
> so we cannot move to pqcomm.c.
> (This is a motivation for introducing libpqwalreceiver library. It is used to avoid to link libpq directly)
> And functions in libpq/fe-connect.c will be included libpq.so,
> but latch related functions like WaitEventSetWait() should not be called from client application.
> So it is also not appropriate.
> In short, there are no good position to place the function because this requires both of libpq and core functions.

Might be on slight different direction, but it looks to me a bit too
much to use WaitEventSet to check only if a socket is live or not.

A quick search in the tree told me that we could use pqSocketCheck()
instead, and I think it would be the something that "could potentially
go into libpq-fe.h" as Önder mentioned, if I understand what he said
correctly.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Horiguchi-san,

> Might be on slight different direction, but it looks to me a bit too
> much to use WaitEventSet to check only if a socket is live or not.
> 
> A quick search in the tree told me that we could use pqSocketCheck()
> instead, and I think it would be the something that "could potentially
> go into libpq-fe.h" as Önder mentioned, if I understand what he said
> correctly.

Based on your suggestion, I tried to add like following function to fe-misc.c:

```
int
PQconncheck(PGconn *conn)
{
    /* Raise an ERROR if invalid socket has come */
    if (conn == NULL ||
        PQsocket(conn) == PGINVALID_SOCKET)
        return -1;

    return pqSocketCheck(conn, 1, 1, -1);
}
```

... and replace pgfdw_connection_check_internal() to PQconncheck(),
but it did not work well.
To be exact, pqSocketCheck() said the socket was "readable" and "writable"
even if the connection has been killed.

IIUC, pqSocketCheck () calls pqSocketPoll(),
and in the pqSocketPoll() we poll()'d the POLLIN or POLLOUT event.
But according to [1], we must wait POLLRDHUP event,
so we cannot reuse it straightforward.

If we really want to move the checking function anyway,
we must follow AddWaitEventToSet() and some WaitEventAdjust functions.
I'm not sure whether it is really good.

[1]: https://man7.org/linux/man-pages/man2/poll.2.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Osumi-san,

> I mainly followed the steps there and
> replaced the command "SELECT" for the remote table at 6-9 with "INSERT"
> command.
> Then, after waiting for few seconds, the "COMMIT" succeeded like below output,
> even after the server stop of the worker side.

> Additionally, the last reference "SELECT" which failed above can succeed,
> if I restart the worker server before the "SELECT" command to the remote table.
> This means the transaction looks successful but the data isn't there ?
> Could you please have a look at this issue ?

Good catch. This was occurred because we disconnected the crashed server.
 
Previously the failed server had been disconnected in the pgfdw_connection_check().
It was OK if the transaction(or statement) was surely cacneled.
But currently the statement might be not canceled because QueryCancelPending might be cleaned up[1].

If we failed to cancel statements and reached pgfdw_xact_callback(),
the connection would not be used because entry->conn is NULL.
That's why you sucseeded to do "COMMIT".
However, the backend did not send "COMMIT" command to the srever,
so inserted data was vanished on remote server.

I understood that we should not call disconnect_pg_server(entry->conn) even if we detect the disconnection.
If should be controlled in Xact callback. This will be modified in next version.

Moreover, I noticed that we should enable timer even if the QueryCancelMessage was not NULL.
If we do not turn on here, the checking will be never enabled again.


[1]:
https://www.postgresql.org/message-id/TYAPR01MB5866CE34430424A588F60FC2F55D9%40TYAPR01MB5866.jpnprd01.prod.outlook.com


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"kuroda.hayato@fujitsu.com"
Date:
Dear Osumi-san,

Thanks for reviewing! PSA v17 patchset.

> (1) v16-0001 : definition of a new structure
> 
> CheckingRemoteServersCallbackItem can be added to typedefs.list.

Added.

> (2) v16-0001 : UnRegisterCheckingRemoteServersCallback's header comment
> 
> +void
> +UnRegisterCheckingRemoteServersCallback(CheckingRemoteServersCallback
> callback,
> +
> void *arg)
> +{
> 
> Looks require a header comment for this function,
> because in this file, all other functions have each one.

Added. Additionally, I renamed this function to Unregister...,
this follows other unregister functions.


> (3) v16-0002 : better place to declare new variables
> 
> New variables 'old' and 'server' in GetConnection() can be moved to
> a scope of 'if' statement where those are used.
> 
> @@ -138,6 +149,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt,
> PgFdwConnState **state)
>         ConnCacheEntry *entry;
>         ConnCacheKey key;
>         MemoryContext ccxt = CurrentMemoryContext;
> +       MemoryContext old;
> +       ForeignServer *server = GetForeignServer(user->serverid);

Fixed.

> (4) v16-0002 : typo in pgfdw_connection_check_internal()
> 
> 
> +       /* return false is if the socket seems to be closed. */
> 
> It should be "return false if the socket seems to be closed" ?

Fixed.

> (5) v16-0002 : pgfdw_connection_check's message
> 
> When I tested the new feature, I got a below message.
> 
> "ERROR:  Foreign Server my_external_server might be down."
> 
> I think we can wrap the server name with double quotes
> so that the message is more aligned with existing error message
> like connect_pg_server.

Fixed.
...I'm not sure whether this message is still need, because
the disconnection was delegated to XactCallback, and the function did following output:

```
WARNING:  FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
       This probably means the server terminated abnormally
       before or while processing the request.
```

> (6) v16-003 : typo
> 
> +      Authors needs to register the callback function via following API.
> 
> Should be "Authors need to register the ...".

Fixed.

> (7) v16-0004 : unnecessary file
> 
> I think we can remove a new file which looks like a log file.
> 
> .../postgres_fdw/expected/postgres_fdw_1.out

This is needed to pass the test on the windows platform. See:
https://www.postgresql.org/docs/devel/regress-variant.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"osumi.takamichi@fujitsu.com"
Date:
On Monday, October 17, 2022 9:25 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote:
> > I mainly followed the steps there and
> > replaced the command "SELECT" for the remote table at 6-9 with "INSERT"
> > command.
> > Then, after waiting for few seconds, the "COMMIT" succeeded like below
> > output, even after the server stop of the worker side.
> I understood that we should not call disconnect_pg_server(entry->conn) even if
> we detect the disconnection.
> If should be controlled in Xact callback. This will be modified in next version.
Hi,

FYI, I quickly rechecked the patch's behavior on that scenario with v17.
Now, the last "COMMIT" above failed expectedly, which looks reasonable behavior.
Thank you for having modified it.


Best Regards,
    Takamichi Osumi


Re: [Proposal] Add foreign-server health checks infrastructure

From
Önder Kalacı
Date:
Hi,

> As far as I can think of, it should probably be a single background task
> checking whether the server is down. If so, sending an invalidation message
> to all the backends such that related backends could act on the
> invalidation and throw an error. This is to cover the use-case you
> described on [1].

Indeed your approach covers the use case I said, but I'm not sure whether it is really good.
In your approach, once the background worker process will manage all foreign servers.
It may be OK if there are a few servers, but if there are hundreds of servers,
the time interval during checks will be longer.

I expect users typically will have a lot more backends than the servers. We can have a threshold for spinning a new bg worker (e.g., every 10 servers gets a new bg worker etc.). Still, I think that'd be an optimization that is probably not necessary for the majority of the users?
 
Currently, each FDW can decide whether we do health checks or not per the backend.
For example, we can skip health checks if the foreign server is not used now.
The background worker cannot control such a way.
Based on the above, I do not agree that we introduce a new background worker and make it to do a health check.

Again, the definition of "health check" is probably different for me. I'd expect the health check to happen continuously, ideally keeping track of how many consecutive times it succeeded and/or last time it failed/succeeded etc.

A transaction failing with a bad error message (or holding some resources locally until the transaction is committed) doesn't sound essential to me. Is there any specific workload are you referring for optimizing to rollback a transaction earlier if a remote server dies?  What kind of workload would benefit from that? Maybe there is, but not clear to me and haven't seen discussed on the thread (sorry if I missed).

I'm trying to understand if we are trying to solve a problem that does not really exists. I'm bringing this up, because I often deal with architectures where there is a local node and remote transaction on different Postgres servers. And, I have not encountered many (or any) pattern that'd benefit from this change much. In fact, I think, on the contrary, this might add significant overhead for OLTP type of high query throughput systems.
 
Moreover, methods to connect to foreign servers and check health are different per FDW.
In terms of mysql_fdw [1], we must do mysql_init() and mysql_real_connect().
About file_fdw, we do not have to connect, but developers may want to calculate checksum and compare.
Therefore, we must provide callback functions anyway.


I think providing callback functions is useful for any case. Each fdw (or in general extension) should be able to provide its own "health check" function.
 
Thanks,
Onder KALACI

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Önder, all,

Thank you for responding and sorry for late response.

> A transaction failing with a bad error message (or holding some resources
> locally until the transaction is committed) doesn't sound essential to me.
> Is there any specific workload are you referring for optimizing to rollback
> a transaction earlier if a remote server dies?  What kind of workload would
> benefit from that? Maybe there is, but not clear to me and haven't seen
> discussed on the thread (sorry if I missed).

I (and my company) worried about overnight batch processing that
contains some accesses to foreign servers. If the transaction is opened overnight and
one of foreign servers is crashed during it, the transaction must be rollbacked.
But there is a possibility that DBAs do not recognize the crash and
they waste a time until the morning. This problem may affect customer's business.
(It may not be sufficient to check the status from another different server.
DBAs must check the network between the databases, and they may be oversight.)
This is a motivation we thought.

> I'm trying to understand if we are trying to solve a problem that does not
> really exists. I'm bringing this up, because I often deal with
> architectures where there is a local node and remote transaction on
> different Postgres servers. And, I have not encountered many (or any)
> pattern that'd benefit from this change much. In fact, I think, on the
> contrary, this might add significant overhead for OLTP type of high query
> throughput systems.

As I said above, I did not considered about OLTP system. But I agreed that the current
callback mechanism may have significant overhead.

Actually, we may not decide the correct way to detect the failure now.
Your argument, which operations should be done by BGworker and we record stats about checking,
seems to be efficient and may be smarter but this may be not match my motivation now.
My approach may have large overhead and may be not able to use for OLTP system.


So how about implementing a check function as an SQL function once and update incrementally?
This still satisfy our motivation and it can avoid overhead because we can reduce the number of calling it.
If we decide that we establish a new connection in the checking function, we can refactor the it.
And if we decide that we introduce health-check BGworker, then we can add a process that calls implemented function
periodically.

PSA patchset that implemented as an SQL function. I moved the checking function to libpq layer, fe-misc.c.
Note that poll() is used here, it means that currently this function can be used on some limited platforms.

I have added a parameter check_all that controls the scope of to-be-checked servers,
But this is not related with my motivation so we can remove if not needed.

(I have not implemented another version that uses epoll() or kqueue(),
because they seem to be not called on the libpq layer. Do you know any reasons?)

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Önder Kalacı
Date:
Hi Hayota Kuroda,


I (and my company) worried about overnight batch processing that
contains some accesses to foreign servers. If the transaction is opened overnight and
one of foreign servers is crashed during it, the transaction must be rollbacked.
But there is a possibility that DBAs do not recognize the crash and
they waste a time until the morning. This problem may affect customer's business.
(It may not be sufficient to check the status from another different server.
DBAs must check the network between the databases, and they may be oversight.)
This is a motivation we thought.

Thanks for the clarification, I agree that this might be a valid concern for systems.
 

So how about implementing a check function as an SQL function once and update incrementally?

I think a SQL function makes sense. 
 
This still satisfy our motivation and it can avoid overhead because we can reduce the number of calling it.

Yes, it makes sense. Also, it allows other extensions to utilize the new libpq function, which is neat.
 
If we decide that we establish a new connection in the checking function, we can refactor the it.
And if we decide that we introduce health-check BGworker, then we can add a process that calls implemented function periodically.

Right, your approach doesn't conflict with a more sophisticated approach, in fact is a useful building block.
 

Note that poll() is used here, it means that currently this function can be used on some limited platforms.


I think more experienced hackers could guide us here. I don't see a problem with that, but checking other functions like pqSocketPoll(), I see that Postgres uses the select system call. I wonder if we can have something similar with select? Seems no, but wanted to bring up in case you know more about that?
 
I have added a parameter check_all that controls the scope of to-be-checked servers,
But this is not related with my motivation so we can remove if not needed.

I guess it kind of makes sense to have the flexibility to check all connections vs only in tx connections.

Though, maybe we should follow a similar approach to postgres_fdw_disconnect(servername) / postgres_fdw_disconnect_all()

    postgres_fdw_verify_connection_states(servername) / postgres_fdw_verify_connection_states_all()

That seems like a more natural API considering the other UDFs. You can also use in combination with postgres_fdw_get_connections()


(I have not implemented another version that uses epoll() or kqueue(),
because they seem to be not called on the libpq layer. Do you know any reasons?)

 
Hmm, I don't know the reason. Is that maybe epoll is available on a smaller number of platforms and libpq can be used on more platforms as being a client library?

Now, some comments regarding the v18:

+static int
+pqConncheck_internal(int sock)
+{
+#if (defined(HAVE_POLL) && defined(POLLRDHUP))
+ struct pollfd input_fd;
+
+ input_fd.fd = sock;
+ input_fd.events = POLLRDHUP;
+ input_fd.revents = 0;
+
+ poll(&input_fd, 1, 0);
+
+ return input_fd.revents & POLLRDHUP;

WaitEventSetWaitBlock's defined(WAIT_USE_POLL) branch uses the following check to find WL_SOCKET_CLOSED

#ifdef POLLRDHUP
if ((cur_event->events & WL_SOCKET_CLOSED) &&
(cur_pollfd->revents & (POLLRDHUP | errflags)))
{
/* remote peer closed, or error */
occurred_events->events |= WL_SOCKET_CLOSED;
}
#endif

Where errflags = POLLHUP | POLLERR | POLLNVAL;

So, should we also be using all these flags like: input_fd.events = POLLRDHUP | errflags;  ?


+ ereport(ERROR,
+ (errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("could not connect to server \"%s\"",
+ server->servername),
+ errdetail("Socket close is detected."),
+ errhint("Please check the health of the server.")));

Is erroring out always necessary? Maybe we should just return true/false, and let the caller decide what to do? For example, you might want to rollback to savepoint and retry?  If the caller wants to rollback the whole transaction, that is possible anyway.

Maybe similar to postgres_fdw_disconnect(), we can warn if the connection is involved in a tx (e.g., depth > 0)? 

new file mode 100644
index 0000000000..3ce169c837
--- /dev/null
+++ b/contrib/postgres_fdw/expected/postgres_fdw_1.out
@@ -0,0 +1,11615 @@
+-- ===================================================================
+-- create FDW objects
+-- ===================================================================
+CREATE EXTENSION postgres_fdw;
+CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
+DO $d$
+    BEGIN

Do we really need this new file or just an oversight in your patch?

Thanks,
Onder KALACI

Software Engineer at Microsoft &

Developing the Citus database extension for PostgreSQL 

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Önder,

Sorry for my late reply. I understood that we got agreement the basic design of first version. Thanks!
I attached new version patches.

> I think more experienced hackers could guide us here. I don't see a problem
> with that, but checking other functions like pqSocketPoll(), I see that
> Postgres uses the select system call. I wonder if we can have something
> similar with select? Seems no, but wanted to bring up in case you know more
> about that?

Here I have been still checking...

> I guess it kind of makes sense to have the flexibility to check all
> connections vs only in tx connections.
>
> Though, maybe we should follow a similar approach
> to postgres_fdw_disconnect(servername) / postgres_fdw_disconnect_all()
>     postgres_fdw_verify_connection_states(servername) /
> postgres_fdw_verify_connection_states_all()
>
> That seems like a more natural API considering the other UDFs. You can also
> use in combination with postgres_fdw_get_connections()

I think you are right, fixed.

> Hmm, I don't know the reason. Is that maybe epoll is available on a smaller
> number of platforms and libpq can be used on more platforms as being a
> client library?

Here I have been still checking...

> So, should we also be using all these flags like: input_fd.events =
> POLLRDHUP | *errflags*;  ?

Fixed.

> Is erroring out always necessary? Maybe we should just return true/false,
> and let the caller decide what to do? For example, you might want to
> rollback to savepoint and retry?  If the caller wants to rollback the whole
> transaction, that is possible anyway.

I changed as you suggested once, but I was not sure we can really do such a thing.
Please see the following case.

```
BEGIN;

SAVEPOINT s;
SELECT 1 FROM ft1 LIMIT 1; -- connect to foreign server

SELECT pg_terminate_backend(pid, 180000) FROM pg_stat_activity
    WHERE application_name = 'fdw_retry_check'; -- kill the remote connection

SELECT 1 FROM ft1 LIMIT 1; -- will fail
-> ERROR:  FATAL:  terminating connection due to administrator command


ROLLBACK TO SAVEPOINT s; -- OK, go back to savepoint...
COMMIT;
-> ERROR:  connection to server "loopback" was lost
```

The test means that even if we go back to the savepoint, 
we could not commit the transaction because postgres_fdw tries to connect to the remove
in pgfdw_xact_callback().

Do we really have a flexibility when the remote connection seems to be closed?
If we do not have, I will revert the change.

> Do we really need this new file or just an oversight in your patch?

Previously there were three file because sometimes the ordering of
output  was changed, but it was no more needed.
Currently this patch contains two comparison files, one is for linux,
and another one is for unsupported OSes like Windows.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Andres Freund
Date:
Hi,

On 2022-11-25 11:41:45 +0000, Hayato Kuroda (Fujitsu) wrote:
> Sorry for my late reply. I understood that we got agreement the basic design of first version. Thanks!
> I attached new version patches.

This is failing on cfbot / CI, as have prior versions.

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F40%2F3388


https://api.cirrus-ci.com/v1/artifact/task/6655254493134848/testrun/build/testrun/postgres_fdw/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out
/tmp/cirrus-ci-build/build/testrun/postgres_fdw/regress/results/postgres_fdw.out
--- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out    2022-12-06 05:21:33.906116000 +0000
+++ /tmp/cirrus-ci-build/build/testrun/postgres_fdw/regress/results/postgres_fdw.out    2022-12-06 05:27:24.929694000
+0000
@@ -11732,10 +11732,9 @@
 -- execute check function. This should return false on supported platforms,
 -- otherwise return true.
 SELECT postgres_fdw_verify_connection_states('loopback');
-WARNING:  08006
  postgres_fdw_verify_connection_states 
 ---------------------------------------
- f
+ t
 (1 row)
 
 ABORT;

The failure happens everywhere except on linux, so presumably there's some
portability issue in the patch.

I've set the patch as waiting on author for now.


Note that you can test CI in your own repository, as explained in src/tools/ci/README

Greetings,

Andres Freund



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Andres,

> This is failing on cfbot / CI, as have prior versions.

Thank you for notifying to me. I attached newer versions.

> The failure happens everywhere except on linux, so presumably there's some
> portability issue in the patch.

Yes, you are right. This feature can be used only for linux, and
previous patches seemed to have bad files.

> Note that you can test CI in your own repository, as explained in
> src/tools/ci/README

Thank you for your information. Actually I was in trouble because
I don't have such machines for developing postgres.
I have confirmed that newer patches could pass the test CI.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

PSA rebased patches.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
vignesh C
Date:
On Tue, 20 Dec 2022 at 07:22, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> PSA rebased patches.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
92957ed98c5c565362ce665266132a7f08f6b0c0 ===
=== applying patch ./v21-0003-add-test.patch
patching file contrib/postgres_fdw/expected/postgres_fdw.out
Hunk #1 FAILED at 11701.
1 out of 1 hunk FAILED -- saving rejects to file
contrib/postgres_fdw/expected/postgres_fdw.out.rej
patching file contrib/postgres_fdw/expected/postgres_fdw_1.out
patching file contrib/postgres_fdw/sql/postgres_fdw.sql
Hunk #1 FAILED at 3876.
1 out of 1 hunk FAILED -- saving rejects to file
contrib/postgres_fdw/sql/postgres_fdw.sql.rej

[1] - http://cfbot.cputube.org/patch_41_3388.log

Regards,
Vignesh



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Vignesh,

Thanks for reporting. PSA rebased version.
These can be applied work well on my HEAD(bd8d45).

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Tom Lane
Date:
"Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> writes:
> Thanks for reporting. PSA rebased version.
> These can be applied work well on my HEAD(bd8d45).

I think that it's a really bad idea to require postgres_fdw.sql
to have two expected-files: that will be a maintenance nightmare.
Please put whatever it is that needs a variant expected-file
into its own, hopefully very small and seldom-changed, test script.
Or rethink whether you really need a test case that has
platform-dependent output.

            regards, tom lane



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear tom,

> I think that it's a really bad idea to require postgres_fdw.sql
> to have two expected-files: that will be a maintenance nightmare.
> Please put whatever it is that needs a variant expected-file
> into its own, hopefully very small and seldom-changed, test script.
> Or rethink whether you really need a test case that has
> platform-dependent output.

Thank you for giving the suggestion. I agreed your saying and modifed that.

I added new functions on the libpq and postgres-fdw layer that check whether the
checking works well or not. In the test, at first, the platform is checked and
the checking function is called only when it is supported.

An alternative approach is that PQCanConncheck() can be combined with PQConncheck().
This can reduce the libpq function, but we must define another returned value to
the function like -2. I was not sure which approach was better.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Ted Yu
Date:


On Tue, Jan 10, 2023 at 8:26 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear tom,

> I think that it's a really bad idea to require postgres_fdw.sql
> to have two expected-files: that will be a maintenance nightmare.
> Please put whatever it is that needs a variant expected-file
> into its own, hopefully very small and seldom-changed, test script.
> Or rethink whether you really need a test case that has
> platform-dependent output.

Thank you for giving the suggestion. I agreed your saying and modifed that.

I added new functions on the libpq and postgres-fdw layer that check whether the
checking works well or not. In the test, at first, the platform is checked and
the checking function is called only when it is supported.

An alternative approach is that PQCanConncheck() can be combined with PQConncheck().
This can reduce the libpq function, but we must define another returned value to
the function like -2. I was not sure which approach was better.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hi,

+       /* quick exit if connection cache has been not initialized yet. */

been not initialized -> not been initialized 

+                                               (errcode(ERRCODE_CONNECTION_FAILURE),
+                                                errmsg("could not connect to server \"%s\"",

Currently each server which is not connected would log a warning.
Is it better to concatenate names for such servers and log one line ? This would be cleaner when there are multiple such servers.

Cheers

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ted,

Thank you for reviewing! PSA new version.

> +       /* quick exit if connection cache has been not initialized yet. */
>
> been not initialized -> not been initialized

Fixed.

> +                                               (errcode(ERRCODE_CONNECTION_FAILURE),
> +                                                errmsg("could not connect to server \"%s\"",
>
> Currently each server which is not connected would log a warning.
> Is it better to concatenate names for such servers and log one line ? This would be cleaner when there are multiple
suchservers.
 

Sounds good, fixed as you said. The following shows the case that two disconnections
are detected by postgres_fdw_verify_connection_states_all().

```
postgres=*# select postgres_fdw_verify_connection_states_all ();
WARNING:  could not connect to server "my_external_server2", "my_external_server"
DETAIL:  Socket close is detected.
HINT:  Plsease check the health of server.
 postgres_fdw_verify_connection_states_all 
-------------------------------------------
 f
(1 row)
```

Currently, the name of servers is concatenated without doing unique checks. IIUC
a backend process cannot connect to the same foreign server by using different
user mapping, so there is no possibility that the same name appears twice.
If the user mapping is altered in the transaction, the cache entry is invalidated
and will not be checked.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

I was not sure, but the cfbot could not be accepted the previous version.
I made the patch again from HEAD(5f6401) without any changes,
so I did not count up the version number.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-01-11 19:04, Hayato Kuroda (Fujitsu) wrote:
> Dear hackers,
> 
> I was not sure, but the cfbot could not be accepted the previous 
> version.
> I made the patch again from HEAD(5f6401) without any changes,
> so I did not count up the version number.
> 
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED

Hi,

Thanks for the patch!
I read the patch v24. These are my comments. Please check.

## v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch
+    <varlistentry id="libpq-PQCanConncheck">
+     
<term><function>PQCanConncheck</function><indexterm><primary>PQCanConncheck</primary></indexterm></term>
+     <listitem>
+      <para>
+       Returns the status of the socket.

Is this description right? I think this description is for
PQConncheck. Something like "Checks whether PQConncheck is
available on this platform." seems better.


+/* Check whether the postgres server is still alive or not */
+extern int PQConncheck(PGconn *conn);
+extern int PQCanConncheck(void);

Should the names of these functions be in the form of PQconnCheck?
Not PQConncheck (c.f. The section of fe-misc.c in libpq-fe.h).


## v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
+PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states);
+PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states_all);
+PG_FUNCTION_INFO_V1(postgres_fdw_can_verify_connection_states);

This patch adds new functions to postgres_fdw for PostgreSQL 16.
So, I think it is necessary to update the version of postgres_fdw (v1.1 
to v1.2).


+    <term><function>postgres_fdw_verify_connection_states_all() returns 
boolean</function></term>
+    <listitem>
+     <para>
+      This function checks the status of remote connections that are 
established
+      by <filename>postgres_fdw</filename> from the local session to 
the foreign
+      servers. This check is performed by polling the socket and allows

It seems better to add a description that states this function
checks all the connections. Like the description of
postgres_fdw_disconnect_all(). For example, "This function
checks the status of 'all the' remote connections..."?

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! PSA new patch set.

> ## v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch
> +    <varlistentry id="libpq-PQCanConncheck">
> +
> <term><function>PQCanConncheck</function><indexterm><primary>PQCan
> Conncheck</primary></indexterm></term>
> +     <listitem>
> +      <para>
> +       Returns the status of the socket.
> 
> Is this description right? I think this description is for
> PQConncheck. Something like "Checks whether PQConncheck is
> available on this platform." seems better.

It might be copy-and-paste error. Thanks for reporting.
According to other parts, the sentence should be started like "Returns ...".
So I followed the style and did cosmetic change.

> +/* Check whether the postgres server is still alive or not */
> +extern int PQConncheck(PGconn *conn);
> +extern int PQCanConncheck(void);
> 
> Should the names of these functions be in the form of PQconnCheck?
> Not PQConncheck (c.f. The section of fe-misc.c in libpq-fe.h).

Agreed, fixed.

> ## v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
> +PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states);
> +PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states_all);
> +PG_FUNCTION_INFO_V1(postgres_fdw_can_verify_connection_states);
> 
> This patch adds new functions to postgres_fdw for PostgreSQL 16.
> So, I think it is necessary to update the version of postgres_fdw (v1.1
> to v1.2).

I checked postgres_fdw commit log, and it seemed that the version was
updated when SQL functions are added. Fixed.

> +    <term><function>postgres_fdw_verify_connection_states_all() returns
> boolean</function></term>
> +    <listitem>
> +     <para>
> +      This function checks the status of remote connections that are
> established
> +      by <filename>postgres_fdw</filename> from the local session to
> the foreign
> +      servers. This check is performed by polling the socket and allows
> 
> It seems better to add a description that states this function
> checks all the connections. Like the description of
> postgres_fdw_disconnect_all(). For example, "This function
> checks the status of 'all the' remote connections..."?

I checked the docs and fixed. Moreover, some inconsistent statements were
also fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Ted Yu
Date:


On Sat, Jan 21, 2023 at 4:03 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
Dear Katsuragi-san,

Thank you for reviewing! PSA new patch set.

> ## v24-0001-Add-PQConncheck-and-PQCanConncheck-to-libpq.patch
> +    <varlistentry id="libpq-PQCanConncheck">
> +
> <term><function>PQCanConncheck</function><indexterm><primary>PQCan
> Conncheck</primary></indexterm></term>
> +     <listitem>
> +      <para>
> +       Returns the status of the socket.
>
> Is this description right? I think this description is for
> PQConncheck. Something like "Checks whether PQConncheck is
> available on this platform." seems better.

It might be copy-and-paste error. Thanks for reporting.
According to other parts, the sentence should be started like "Returns ...".
So I followed the style and did cosmetic change.

> +/* Check whether the postgres server is still alive or not */
> +extern int PQConncheck(PGconn *conn);
> +extern int PQCanConncheck(void);
>
> Should the names of these functions be in the form of PQconnCheck?
> Not PQConncheck (c.f. The section of fe-misc.c in libpq-fe.h).

Agreed, fixed.

> ## v24-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
> +PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states);
> +PG_FUNCTION_INFO_V1(postgres_fdw_verify_connection_states_all);
> +PG_FUNCTION_INFO_V1(postgres_fdw_can_verify_connection_states);
>
> This patch adds new functions to postgres_fdw for PostgreSQL 16.
> So, I think it is necessary to update the version of postgres_fdw (v1.1
> to v1.2).

I checked postgres_fdw commit log, and it seemed that the version was
updated when SQL functions are added. Fixed.

> +    <term><function>postgres_fdw_verify_connection_states_all() returns
> boolean</function></term>
> +    <listitem>
> +     <para>
> +      This function checks the status of remote connections that are
> established
> +      by <filename>postgres_fdw</filename> from the local session to
> the foreign
> +      servers. This check is performed by polling the socket and allows
>
> It seems better to add a description that states this function
> checks all the connections. Like the description of
> postgres_fdw_disconnect_all(). For example, "This function
> checks the status of 'all the' remote connections..."?

I checked the docs and fixed. Moreover, some inconsistent statements were
also fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Hi,
For v25-0001-Add-PQConnCheck-and-PQCanConnCheck-to-libpq.patch , `pqConnCheck_internal` only has one caller which is quite short.
Can pqConnCheck_internal and PQConnCheck be merged into one func ?

+int
+PQCanConnCheck(void)

It seems the return value should be of bool type.

Cheers

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
> Thank you for reviewing! PSA new patch set.

Sorry, I missed the updated file in the patch. New version will be posted soon.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Ted,

Thanks for reviewing! PSA new version.

> For v25-0001-Add-PQConnCheck-and-PQCanConnCheck-to-libpq.patch , `pqConnCheck_internal` only has one caller which is
quiteshort.
 
> Can pqConnCheck_internal and PQConnCheck be merged into one func ?

I divided the function for feature expandability. Currently it works on linux platform,
but the limitation should be removed in future and internal function will be longer.
Therefore I want to keep this style.

> +int
> +PQCanConnCheck(void)
>
> It seems the return value should be of bool type.

I slightly changed the returned value like true/false. But IIUC libpq functions
cannot define as "bool" datatype. E.g. PQconnectionNeedsPassword() returns true/false,
but the function is defined as int.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-01-23 14:40, Hayato Kuroda (Fujitsu) wrote:
> Dear Ted,
> 
> Thanks for reviewing! PSA new version.
> 
>> For v25-0001-Add-PQConnCheck-and-PQCanConnCheck-to-libpq.patch , 
>> `pqConnCheck_internal` only has one caller which is quite short.
>> Can pqConnCheck_internal and PQConnCheck be merged into one func ?
> 
> I divided the function for feature expandability. Currently it works
> on linux platform,
> but the limitation should be removed in future and internal function
> will be longer.
> Therefore I want to keep this style.
> 
>> +int
>> +PQCanConnCheck(void)
>> 
>> It seems the return value should be of bool type.
> 
> I slightly changed the returned value like true/false. But IIUC libpq 
> functions
> cannot define as "bool" datatype. E.g. PQconnectionNeedsPassword()
> returns true/false,
> but the function is defined as int.
> 
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED

Thank you for updating the patch!

+/* Check whether the postgres server is still alive or not */
+extern int PQConnCheck(PGconn *conn);
+extern int PQCanConnCheck(void);

Aren't these PQconnCheck and PQcanConnCheck? I think the first letter
following 'PQ' should be lower case.

regards.

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reading the patch! PSA new version.

> Thank you for updating the patch!
> 
> +/* Check whether the postgres server is still alive or not */
> +extern int PQConnCheck(PGconn *conn);
> +extern int PQCanConnCheck(void);
> 
> Aren't these PQconnCheck and PQcanConnCheck? I think the first letter
> following 'PQ' should be lower case.

I cannot find any rules about it, but seems right. Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

I have updated my patch for error handling and kqueue() support.
Actually I do not have BSD-like machine, but I developed by using github CICD.
I think at first we should focus on 0001-0003, and then work for 0004.

Followings are change notes and my analysis.

0001

* Fix missed replacements from PQConnCheck() to PQconnCheck().
* Error handling was improved. Now we can detect the failure of poll() and return -1 at that time.
* I thought we don't have to add select(2) in PQconnCheck(). According to man page [1],
  select(2) can be only used for watch whether the status is readable, writable, or exceptional condition.
  It means that select() does not have an event corresponding to POLLRDHUP.

0002, 0003
Not changed

0004

* Add kqueue(2) support() for BSD family.
* I did not add epoll() support, because it can be used only on linux and such systems have POLLRDHUP instead. 
 checked other codes in libpq, and they do not use epoll(). We can see that such an event does not specified in POSIX
[2]
 and it can be used for linux only. I decided to use poll() as much as possible to keep the policy.
* While coding, I found that there are no good timing to close the kernel event queue.
 It means that the lifetime of kqueue becomes same as the client program and occupy the small memory forever.
 I'm not sure it can be accepted.


[1]: https://man7.org/linux/man-pages/man2/select.2.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/poll.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
I found cfbot failure, PSA fixed version.
Sorry for noise.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-01-27 15:57, Hayato Kuroda (Fujitsu) wrote:
> I found cfbot failure, PSA fixed version.
> Sorry for noise.
> 
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED

Hi Kuroda-san,

Thank you for updating the patch! Sorry for the late reply.

0001:
+       while (result < 0 && errno == EINTR);
+
+       if (result < 0)
+                       return -1;

this `return -1` is not indented properly.


0002:
+    <term><function>postgres_fdw_verify_connection_states(server_name 
text) returns boolean</function></term>
...
+      extension to the <symbol>poll</symbol> system call, including 
Linux. This
+      returns <literal>true</literal> if checked connections are still 
valid,
+      or the checking is not supported on this platform. 
<literal>false</literal>
+      is returned if the local session seems to be disconnected from 
other
+      servers. Example usage of the function:

Here, 'still valid' seems a little bit confusing because this 'valid' is 
not
the same as postgres_fdw_get_connections's 'valid' [1].

Should 'still valid' be 'existing connection is not closed by the remote 
peer'?
But this description does not cover all the cases where this function 
returns true...
I think one choice is to write all the cases like 'returns true if any 
of the
following condition is satisfied. 1) existing connection is not closed 
by the
remote peer 2) there is no connection for specified server yet 3) the 
checking
is not supported...'. If my understanding is not correct, please point 
it out.

BTW, is it reasonable to return true if ConnectionHash is not 
initialized or
there is no ConnCacheEntry for specified remote server? What do you 
think
about returning NULL in that case?


0003:
I think it is better that the test covers all the new functions.
How about adding a test for postgres_fdw_verify_connection_states_all?


+-- ===================================================================
+-- test for postgres_fdw_verify_foreign_servers function
+-- ===================================================================

Writing all the functions' name like 'test for 
postgres_fdw_verify_connection_states
and postgres_fdw_can_verify_connection_states' looks straightforward.
What do you think about this?


0004:
Sorry, I have not read 0004. I will read.


[1]: 
https://github.com/postgres/postgres/blob/master/doc/src/sgml/postgres-fdw.sgml#L764-L765

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Kyotaro Horiguchi
Date:
At Fri, 27 Jan 2023 06:57:01 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in 
> I found cfbot failure, PSA fixed version.

+       Unlike <xref linkend="libpq-PQstatus"/>, this function checks socket
+       health. This check is performed by polling the socket. This function is
+       currently available only on systems that support the non-standard
+       <symbol>POLLRDHUP</symbol> extension to the <symbol>poll</symbol> system

I find it quite confusing that we have pqSocketCheck and PQconnCheck,
that does almost the same thing.. Since pqSocketCheck is a static
function, we can modify the function as we like.

I still don't understand why we need pqconnCheck_internal separate
from pqSocketPoll(), and PQconnCheck from pqSocketCheck.


https://www.postgresql.org/message-id/flat/TYAPR01MB58665BF23D38EDF10028DE2AF5299%40TYAPR01MB5866.jpnprd01.prod.outlook.com#47d21431bf9fa94f763c824f6e81fa54
> IIUC, pqSocketCheck () calls pqSocketPoll(),
> and in the pqSocketPoll() we poll()'d the POLLIN or POLLOUT event.
> But according to [1], we must wait POLLRDHUP event,
> so we cannot reuse it straightforward.

Yeah, I didn't suggest to use the function as-is. Couldn't we extend
the fucntion by letting it accept end_time = 0 && !forRead &&
!forWrite, not causing side effects?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Horiguchi-san,

Thank you for checking! The patch will be attached to another mail.

> At Fri, 27 Jan 2023 06:57:01 +0000, "Hayato Kuroda (Fujitsu)"
> <kuroda.hayato@fujitsu.com> wrote in
> > I found cfbot failure, PSA fixed version.
>
> +       Unlike <xref linkend="libpq-PQstatus"/>, this function checks socket
> +       health. This check is performed by polling the socket. This function is
> +       currently available only on systems that support the non-standard
> +       <symbol>POLLRDHUP</symbol> extension to the
> <symbol>poll</symbol> system
>
> I find it quite confusing that we have pqSocketCheck and PQconnCheck,
> that does almost the same thing.. Since pqSocketCheck is a static
> function, we can modify the function as we like.

Renamed to pqSocketIsReadableOrWritableOrValid(), but seemed very bad...

> I still don't understand why we need pqconnCheck_internal separate
> from pqSocketPoll(), and PQconnCheck from pqSocketCheck.

pqconnCheck_internal() was combined into pqSocketPoll(), but PQconnCheck() still
exists. libpq-fe.h, did not include standard header files except stdio.h. I'm not
sure whether we can add an inclusion of time.h, because it may break the compatibility
that some platform may not have the header. If there are not such a system, we may
able to export pqSocketCheck() and remove PQconnCheck().

The side effect of this changes is that codes become dirty when we add kqueue() support...

> https://www.postgresql.org/message-id/flat/TYAPR01MB58665BF23D38EDF1
> 0028DE2AF5299%40TYAPR01MB5866.jpnprd01.prod.outlook.com#47d21431bf9
> fa94f763c824f6e81fa54
> > IIUC, pqSocketCheck () calls pqSocketPoll(),
> > and in the pqSocketPoll() we poll()'d the POLLIN or POLLOUT event.
> > But according to [1], we must wait POLLRDHUP event,
> > so we cannot reuse it straightforward.
>
> Yeah, I didn't suggest to use the function as-is. Couldn't we extend
> the fucntion by letting it accept end_time = 0 && !forRead &&
> !forWrite, not causing side effects?

Modified accordingly. Is it what you expected?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! PSA new version patches.

> 0001:
> +       while (result < 0 && errno == EINTR);
> +
> +       if (result < 0)
> +                       return -1;
> 
> this `return -1` is not indented properly.

This part is no longer needed. Please see another discussion[1].

> 0002:
> +    <term><function>postgres_fdw_verify_connection_states(server_name
> text) returns boolean</function></term>
> ...
> +      extension to the <symbol>poll</symbol> system call, including
> Linux. This
> +      returns <literal>true</literal> if checked connections are still
> valid,
> +      or the checking is not supported on this platform.
> <literal>false</literal>
> +      is returned if the local session seems to be disconnected from
> other
> +      servers. Example usage of the function:
> 
> Here, 'still valid' seems a little bit confusing because this 'valid' is
> not
> the same as postgres_fdw_get_connections's 'valid' [1].
> 
> Should 'still valid' be 'existing connection is not closed by the remote
> peer'?
> But this description does not cover all the cases where this function
> returns true...
> I think one choice is to write all the cases like 'returns true if any
> of the
> following condition is satisfied. 1) existing connection is not closed
> by the
> remote peer 2) there is no connection for specified server yet 3) the
> checking
> is not supported...'. If my understanding is not correct, please point
> it out.

Modified like you pointed out.

> BTW, is it reasonable to return true if ConnectionHash is not
> initialized or
> there is no ConnCacheEntry for specified remote server? What do you
> think
> about returning NULL in that case?

I'm not sure which one is better, but modified accordingly.

> 0003:
> I think it is better that the test covers all the new functions.
> How about adding a test for postgres_fdw_verify_connection_states_all?
> 
> 
> +--
> ===================================================
> ================
> +-- test for postgres_fdw_verify_foreign_servers function
> +--
> ===================================================
> ================
> 
> Writing all the functions' name like 'test for
> postgres_fdw_verify_connection_states
> and postgres_fdw_can_verify_connection_states' looks straightforward.
> What do you think about this?

Added.

> 0004:
> Sorry, I have not read 0004. I will read.

No problem:-).

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58664E039F45959AB321FA1FF5D99%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-02-09 23:39, Hayato Kuroda (Fujitsu) wrote:
> Dear Katsuragi-san,
> 
> Thank you for reviewing! PSA new version patches.

Thank you for updating the patch! These are my comments,
please check.

0001:
Extending pqSocketPoll seems to be a better way because we can
avoid having multiple similar functions. I also would like to hear
horiguchi-san's opinion whether this matches his expectation.
Improvements of pqSocketPoll/pqSocketCheck is discussed in this
thread[1]. I'm concerned with the discussion.

As for the function's name, what do you think about keeping
current name (pqSocketCheck)? pqSocketIsReadable... describes
the functionality very well though.

pqConnCheck seems to be a family of pqReadReady or pqWriteRedy,
so how about placing pqConnCheck below them?

+ * Moreover, when neither forRead nor forWrite is requested and timeout 
is
+ * disabled, try to check the health of socket.
Isn't it better to put the comment on how the arguments are
interpreted before the description of return value?

+#if defined(POLLRDHUP)
+            input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
...
+    input_fd.events |= POLLERR;
To my understanding, POLLHUP, POLLNVAL and POLLERR are ignored
in event. Are they necessary?


0002:
As for the return value of postgres_fdw_verify_connection_states,
what do you think about returning NULL when connection-checking
is not performed? I think there are two cases 1) ConnectionHash
is not initialized or 2) connection is not found for specified
server name, That is, no entry passes the first if statement below
(case 2)).

```
          if (all || entry->serverid == serverid)
          {
              if (PQconnCheck(entry->conn))
              {
```

0004:
I'm wondering if we should add kqueue support in this version,
because adding kqueue support introduces additional things to
be considered. What do you think about focusing on the main
functionality using poll in this patch and going for kqueue
support after this patch?


[1]: 
https://www.postgresql.org/message-id/20230209.115009.2229702014236187289.horikyota.ntt%40gmail.com

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Peter Smith
Date:
Here is a code review only for patch v31-0001.

======
General Comment

1.
PQcanConnCheck seemed like a strange API name. Maybe it can have the
same prefix as the other?

e.g.

- PQconnCheck()
- PGconnCheckSupported()

or

- PQconnCheck()
- PGconnCheckable()

======
Commit Message

2.
PqconnCheck() function allows to check the status of socket by polling
the socket. This function is currently available only on systems that
support the non-standard POLLRDHUP extension to the poll system call,
including Linux.

PqcanConnCheck() checks whether above function is available or not.

~

2a.
"status of socket" --> "status of the connection"

~

2b.
"above function" --> "the above function"

======
doc/src/sgml/libpq.sgml

3. PQconnCheck

Returns the health of the socket.

int PQconnCheck(PGconn *conn);

Unlike PQstatus, this function checks socket health. This check is
performed by polling the socket. This function is currently available
only on systems that support the non-standard POLLRDHUP extension to
the poll system call, including Linux. PQconnCheck returns greater
than zero if the remote peer seems to be closed, returns 0 if the
socket is valid, and returns -1 if the connection has been already
invalid or an error is error occurred.

~

3a.
Should these descriptions be referring to the health of the
*connection* rather than the health of the socket?

~

3b.
"has been already invalid" ?? wording

~~~

4. PQcanConnCheck

Returns whether PQconnCheck is available on this platform.
PQcanConnCheck returns 1 if the function is supported, otherwise
returns 0.

~

I thought this should be worded using "true" and "false" same as other
boolean functions on this page.

SUGGESTION
Returns true (1) or false (0) to indicate if the PQconnCheck function
is supported on this platform.

======
src/interfaces/libpq/fe-misc.c

5.
-static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
-   time_t end_time);
+static int pqSocketIsReadableOrWritableOrValid(PGconn *conn, int forRead,
+ int forWrite, time_t end_time);

I was not 100% sure overloading this API is the right thing to do.
Doesn't this introduce a subtle side-effect on some of the existing
callers? e.g. Previously pqWaitTimed would ALWAYS return 0 if
forRead/forWrite were both false. But now other return values like
errors will be possible. Is that OK?

~~~

6. pqSocketPoll

/*
 * Check a file descriptor for read and/or write data, possibly waiting.
 * If neither forRead nor forWrite are set, immediately return a timeout
 * condition (without waiting).  Return >0 if condition is met, 0
 * if a timeout occurred, -1 if an error or interrupt occurred.
 *
 * Timeout is infinite if end_time is -1.  Timeout is immediate (no blocking)
 * if end_time is 0 (or indeed, any time before now).
 *
 * Moreover, when neither forRead nor forWrite is requested and timeout is
 * disabled, try to check the health of socket.
 */


The new comment "Moreover..." is contrary to the earlier part of the
same comment which already said, "If neither forRead nor forWrite are
set, immediately return a timeout condition (without waiting)."

There might be side-effects to previous/existing callers of this
function (e.g. pqWaitTimed via pqSocketCheck)

~~~

7.
  if (!forRead && !forWrite)
- return 0;
+ {
+ /* Try to check the health if requested */
+ if (end_time == 0)
+#if defined(POLLRDHUP)
+ input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
+#else
+ return 0;
+#endif /* defined(POLLRDHUP) */
+ else
+ return 0;
+ }

FYI - I think the new code can be simpler without needing #else by
calling your other new function.

SUGGESTION
if (!forRead && !forWrite)
{
if (!PQcanConnCheck() || end_time != 0)
return 0;

/* Check the connection health when end_time is 0 */
Assert(PQcanConnCheck() && end_time == 0);
#if defined(POLLRDHUP)
input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
#endif
}

~~~

8. PQconnCheck
+/*
+ * Check whether PQconnCheck() can work well on this platform.
+ *
+ * Returns 1 if this can use PQconnCheck(), otherwise 0.
+ */
+int
+PQcanConnCheck(void)
+{
+#if (defined(HAVE_POLL) && defined(POLLRDHUP))
+ return true;
+#else
+ return false;
+#endif
+}

~

8a.
"can work well" --> "works"

~

8b.
Maybe better to say "true (1)" and "otherwise false (0)"

------
Kind Regards,
Peter Smith.
Fujitsu Australia



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! PSA new version.

> 0001:
> Extending pqSocketPoll seems to be a better way because we can
> avoid having multiple similar functions. I also would like to hear
> horiguchi-san's opinion whether this matches his expectation.
> Improvements of pqSocketPoll/pqSocketCheck is discussed in this
> thread[1]. I'm concerned with the discussion.

I checked the thread and seems correct. I can post +1 to the thread.
And the modification will be automatically reflected to the feature
if we use the same function, I thought.

> As for the function's name, what do you think about keeping
> current name (pqSocketCheck)? pqSocketIsReadable... describes
> the functionality very well though.

No objection, I can keep the shorter name.

> pqConnCheck seems to be a family of pqReadReady or pqWriteRedy,
> so how about placing pqConnCheck below them?

Moved.

> + * Moreover, when neither forRead nor forWrite is requested and timeout
> is
> + * disabled, try to check the health of socket.
> Isn't it better to put the comment on how the arguments are
> interpreted before the description of return value?
> 
> +#if defined(POLLRDHUP)
> +            input_fd.events = POLLRDHUP | POLLHUP |
> POLLNVAL;
> ...
> +    input_fd.events |= POLLERR;
> To my understanding, POLLHUP, POLLNVAL and POLLERR are ignored
> in event. Are they necessary?

I read man poll(3) again, and I found that these event is ignored when
it sets to the events attribute. So removed.

> 0002:
> As for the return value of postgres_fdw_verify_connection_states,
> what do you think about returning NULL when connection-checking
> is not performed? I think there are two cases 1) ConnectionHash
> is not initialized or 2) connection is not found for specified
> server name, That is, no entry passes the first if statement below
> (case 2)).
> 
> ```
>           if (all || entry->serverid == serverid)
>           {
>               if (PQconnCheck(entry->conn))
>               {
> ```

I think in that case we can follow postgres_fdw_disconnect().
About postgres_fdw_disconnect(), if the given server_name does not exist,
an error is reported. This is a current behavior so I want to keep it.
Besides, I added the description to document.

> 0004:
> I'm wondering if we should add kqueue support in this version,
> because adding kqueue support introduces additional things to
> be considered. What do you think about focusing on the main
> functionality using poll in this patch and going for kqueue
> support after this patch?

I think it is better because it can keep patches smaller. So I stopped attaching 0004.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Peter,

Thank you for reviewing! Latest version can be seen in [1].

> 1.
> PQcanConnCheck seemed like a strange API name. Maybe it can have the
> same prefix as the other?
> 
> e.g.
> 
> - PQconnCheck()
> - PGconnCheckSupported()
> 
> or
> 
> - PQconnCheck()
> - PGconnCheckable()
>

I choose PQconnCheckable().

> 2.
> PqconnCheck() function allows to check the status of socket by polling
> the socket. This function is currently available only on systems that
> support the non-standard POLLRDHUP extension to the poll system call,
> including Linux.
> 
> PqcanConnCheck() checks whether above function is available or not.
> 
> ~
> 
> 2a.
> "status of socket" --> "status of the connection"
> 
> ~
> 
> 2b.
> "above function" --> "the above function"

Fixed.

> doc/src/sgml/libpq.sgml
> 
> 3. PQconnCheck
> 
> Returns the health of the socket.
> 
> int PQconnCheck(PGconn *conn);
> 
> Unlike PQstatus, this function checks socket health. This check is
> performed by polling the socket. This function is currently available
> only on systems that support the non-standard POLLRDHUP extension to
> the poll system call, including Linux. PQconnCheck returns greater
> than zero if the remote peer seems to be closed, returns 0 if the
> socket is valid, and returns -1 if the connection has been already
> invalid or an error is error occurred.
> 
> ~
> 
> 3a.
> Should these descriptions be referring to the health of the
> *connection* rather than the health of the socket?

Reworded.

> 3b.
> "has been already invalid" ?? wording

I checked codes and found that the socket becomes PGINVALID_SOCKET
after being closed. So I clarified that.

> 4. PQcanConnCheck
> 
> Returns whether PQconnCheck is available on this platform.
> PQcanConnCheck returns 1 if the function is supported, otherwise
> returns 0.
> 
> ~
> 
> I thought this should be worded using "true" and "false" same as other
> boolean functions on this page.
> 
> SUGGESTION
> Returns true (1) or false (0) to indicate if the PQconnCheck function
> is supported on this platform.

Fixed.

> ======
> src/interfaces/libpq/fe-misc.c
> 
> 5.
> -static int pqSocketCheck(PGconn *conn, int forRead, int forWrite,
> -   time_t end_time);
> +static int pqSocketIsReadableOrWritableOrValid(PGconn *conn, int forRead,
> + int forWrite, time_t end_time);
> 
> I was not 100% sure overloading this API is the right thing to do.
> Doesn't this introduce a subtle side-effect on some of the existing
> callers? e.g. Previously pqWaitTimed would ALWAYS return 0 if
> forRead/forWrite were both false. But now other return values like
> errors will be possible. Is that OK?

I checked pqWaitTimed() and seems not to affect other parts.

As the first place, it is an internal function and will be never called from clients.
There are 2 functions that call that - connectDBComplete() and pqWait(), and both of
them are not set finish_time to zero. In connectDBComplete(), basically finish_time 
is set to -1,  and set to time(NULL) + connect_timeout if the timeout is enabled.


> 6. pqSocketPoll
> 
> /*
>  * Check a file descriptor for read and/or write data, possibly waiting.
>  * If neither forRead nor forWrite are set, immediately return a timeout
>  * condition (without waiting).  Return >0 if condition is met, 0
>  * if a timeout occurred, -1 if an error or interrupt occurred.
>  *
>  * Timeout is infinite if end_time is -1.  Timeout is immediate (no blocking)
>  * if end_time is 0 (or indeed, any time before now).
>  *
>  * Moreover, when neither forRead nor forWrite is requested and timeout is
>  * disabled, try to check the health of socket.
>  */
> 
> 
> The new comment "Moreover..." is contrary to the earlier part of the
> same comment which already said, "If neither forRead nor forWrite are
> set, immediately return a timeout condition (without waiting)."
> 
> There might be side-effects to previous/existing callers of this
> function (e.g. pqWaitTimed via pqSocketCheck)

Comments were fixed. About the side-effect, please see previous discussion.

> 7.
>   if (!forRead && !forWrite)
> - return 0;
> + {
> + /* Try to check the health if requested */
> + if (end_time == 0)
> +#if defined(POLLRDHUP)
> + input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
> +#else
> + return 0;
> +#endif /* defined(POLLRDHUP) */
> + else
> + return 0;
> + }
> 
> FYI - I think the new code can be simpler without needing #else by
> calling your other new function.
> 
> SUGGESTION
> if (!forRead && !forWrite)
> {
> if (!PQcanConnCheck() || end_time != 0)
> return 0;
> 
> /* Check the connection health when end_time is 0 */
> Assert(PQcanConnCheck() && end_time == 0);
> #if defined(POLLRDHUP)
> input_fd.events = POLLRDHUP | POLLHUP | POLLNVAL;
> #endif
> }

Fixed.

> 8. PQconnCheck
> +/*
> + * Check whether PQconnCheck() can work well on this platform.
> + *
> + * Returns 1 if this can use PQconnCheck(), otherwise 0.
> + */
> +int
> +PQcanConnCheck(void)
> +{
> +#if (defined(HAVE_POLL) && defined(POLLRDHUP))
> + return true;
> +#else
> + return false;
> +#endif
> +}
> 
> ~
> 
> 8a.
> "can work well" --> "works"
> 
> ~
> 
> 8b.
> Maybe better to say "true (1)" and "otherwise false (0)"

Fixed.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58669C95604A0EB7BCC1B58CF5A49%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Peter Smith
Date:
Here are some review comments for v32-0001.

======
Commit message

1.
PQconnCheck() function allows to check the status of the socket by polling
the socket. This function is currently available only on systems that
support the non-standard POLLRDHUP extension to the poll system call,
including Linux.

~

(missed fix from previous review)

"status of the socket" --> "status of the connection"

====
doc/src/sgml/libpq.sgml

2. PQconnCheck
+      <para>
+       This function check the health of the connection. Unlike <xref
linkend="libpq-PQstatus"/>,
+       this check is performed by polling the corresponding socket. This
+       function is currently available only on systems that support the
+       non-standard <symbol>POLLRDHUP</symbol> extension to the
<symbol>poll</symbol>
+       system call, including Linux. <xref linkend="libpq-PQconnCheck"/>
+       returns greater than zero if the remote peer seems to be closed, returns
+       <literal>0</literal> if the socket is valid, and returns
<literal>-1</literal>
+       if the connection has already been closed or an error has occurred.
+      </para>

"check the health" --> "checks the health"

~~~

3. PQcanConnCheck

+      <para>
+       Returns true (1) or false (0) to indicate if the PQconnCheck function
+       is supported on this platform.

Should the reference to PQconnCheck be a link as it previously was?

======
src/interfaces/libpq/fe-misc.c

4. PQconnCheck

+/*
+ * Check whether the socket peer closed connection or not.
+ *
+ * Returns >0 if remote peer seems to be closed, 0 if it is valid,
+ * -1 if the input connection is bad or an error occurred.
+ */
+int
+PQconnCheck(PGconn *conn)
+{
+ return pqSocketCheck(conn, 0, 0, (time_t) 0);
+}

I'm confused. This comment says =0 means connection is valid. But the
pqSocketCheck comment says =0 means it timed out.

So those two function comments don't seem compatible

~~~

5. PQconnCheckable

+/*
+ * Check whether PQconnCheck() works well on this platform.
+ *
+ * Returns true (1) if this can use PQconnCheck(), otherwise false (0).
+ */
+int
+PQconnCheckable(void)
+{
+#if (defined(HAVE_POLL) && defined(POLLRDHUP))
+ return true;
+#else
+ return false;
+#endif
+}

Why say "works well"? IMO it either works or doesn't work – there is no "well".

SUGGESTION1
Check whether PQconnCheck() works on this platform.

SUGGESTION2
Check whether PQconnCheck() can work on this platform.

~~~

6. pqSocketCheck

 /*
  * Checks a socket, using poll or select, for data to be read, written,
- * or both.  Returns >0 if one or more conditions are met, 0 if it timed
+ * or both. Moreover, when neither forRead nor forWrite is requested and
+ * timeout is disabled, try to check the health of socket.
+ *
+ * Returns >0 if one or more conditions are met, 0 if it timed
  * out, -1 if an error occurred.
  *
  * If SSL is in use, the SSL buffer is checked prior to checking the socket

~

See review comment #4. (e.g. This says =0 if it timed out).

~~~

7. pqSocketPoll

+ * When neither forRead nor forWrite are set and timeout is disabled,
+ *
+ * - If the timeout is disabled, try to check the health of the socket
+ * - Otherwise this immediately returns 0
+ *
+ * Return >0 if condition is met, 0 if a timeout occurred, -1 if an error
+ * or interrupt occurred.

Don't say "and timeout is disabled," because it clashes with the 1st
bullet which also says "- If the timeout is disabled,".

------
Kind Regards,
Peter Smith.
Fujitsu Australia



Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
Hi Kuroda-san,

Thank you for updating the patch!

On 2023-02-20 15:42, Hayato Kuroda (Fujitsu) wrote:
> Dear Katsuragi-san,
> 
> Thank you for reviewing! PSA new version.

I rethought the pqSocketPoll part. Current interpretation of
arguments seems a little bit confusing because a specific pattern
of arguments has a different meaning. What do you think about
introducing a new argument like `int forConnCheck`? This seems
straightforward and readable.


>> 0002:
>> As for the return value of postgres_fdw_verify_connection_states,
>> what do you think about returning NULL when connection-checking
>> is not performed? I think there are two cases 1) ConnectionHash
>> is not initialized or 2) connection is not found for specified
>> server name, That is, no entry passes the first if statement below
>> (case 2)).
>> 
>> ```
>>           if (all || entry->serverid == serverid)
>>           {
>>               if (PQconnCheck(entry->conn))
>>               {
>> ```
> 
> I think in that case we can follow postgres_fdw_disconnect().
> About postgres_fdw_disconnect(), if the given server_name does not 
> exist,
> an error is reported.

Yes, I think this error is fine.
I think there are cases where the given server name does exist,
however the connection check is not performed (does not pass the
first if statement above). 1) a case where the server name exist,
however an open connection to the specified server is not found.
2) case where connection for specified server is invalidated.
3) case where there is not ConnectionHash entry for the specified
server. Current implementation returns true in that case, however
the check is not performed. Suppose the checking mechanism is
supported on the platform, it does not seem reasonable to return
true or false when the check is not performed. What do you think?

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Peter,

Thank you for reviewing! PSA new version.

> 1.
> PQconnCheck() function allows to check the status of the socket by polling
> the socket. This function is currently available only on systems that
> support the non-standard POLLRDHUP extension to the poll system call,
> including Linux.
> 
> ~
> 
> (missed fix from previous review)
> 
> "status of the socket" --> "status of the connection"

Sorry, fixed.

> ====
> doc/src/sgml/libpq.sgml
> 
> 2. PQconnCheck
> +      <para>
> +       This function check the health of the connection. Unlike <xref
> linkend="libpq-PQstatus"/>,
> +       this check is performed by polling the corresponding socket. This
> +       function is currently available only on systems that support the
> +       non-standard <symbol>POLLRDHUP</symbol> extension to the
> <symbol>poll</symbol>
> +       system call, including Linux. <xref linkend="libpq-PQconnCheck"/>
> +       returns greater than zero if the remote peer seems to be closed, returns
> +       <literal>0</literal> if the socket is valid, and returns
> <literal>-1</literal>
> +       if the connection has already been closed or an error has occurred.
> +      </para>
> 
> "check the health" --> "checks the health"

Fixed.

> 3. PQcanConnCheck
> 
> +      <para>
> +       Returns true (1) or false (0) to indicate if the PQconnCheck function
> +       is supported on this platform.
> 
> Should the reference to PQconnCheck be a link as it previously was?

Right, fixed.

> src/interfaces/libpq/fe-misc.c
> 
> 4. PQconnCheck
> 
> +/*
> + * Check whether the socket peer closed connection or not.
> + *
> + * Returns >0 if remote peer seems to be closed, 0 if it is valid,
> + * -1 if the input connection is bad or an error occurred.
> + */
> +int
> +PQconnCheck(PGconn *conn)
> +{
> + return pqSocketCheck(conn, 0, 0, (time_t) 0);
> +}
> 
> I'm confused. This comment says =0 means connection is valid. But the
> pqSocketCheck comment says =0 means it timed out.
> 
> So those two function comments don't seem compatible

Added further descriptions atop pqSocketCheck() and pqSocketPoll().
Is it helpful to understand?

> 5. PQconnCheckable
> 
> +/*
> + * Check whether PQconnCheck() works well on this platform.
> + *
> + * Returns true (1) if this can use PQconnCheck(), otherwise false (0).
> + */
> +int
> +PQconnCheckable(void)
> +{
> +#if (defined(HAVE_POLL) && defined(POLLRDHUP))
> + return true;
> +#else
> + return false;
> +#endif
> +}
> 
> Why say "works well"? IMO it either works or doesn't work – there is no "well".
> 
> SUGGESTION1
> Check whether PQconnCheck() works on this platform.
> 
> SUGGESTION2
> Check whether PQconnCheck() can work on this platform.

I choose 2.

> 6. pqSocketCheck
> 
>  /*
>   * Checks a socket, using poll or select, for data to be read, written,
> - * or both.  Returns >0 if one or more conditions are met, 0 if it timed
> + * or both. Moreover, when neither forRead nor forWrite is requested and
> + * timeout is disabled, try to check the health of socket.
> + *
> + * Returns >0 if one or more conditions are met, 0 if it timed
>   * out, -1 if an error occurred.
>   *
>   * If SSL is in use, the SSL buffer is checked prior to checking the socket
> 
> ~
> 
> See review comment #4. (e.g. This says =0 if it timed out).

Descriptions were added.

> 7. pqSocketPoll
> 
> + * When neither forRead nor forWrite are set and timeout is disabled,
> + *
> + * - If the timeout is disabled, try to check the health of the socket
> + * - Otherwise this immediately returns 0
> + *
> + * Return >0 if condition is met, 0 if a timeout occurred, -1 if an error
> + * or interrupt occurred.
> 
> Don't say "and timeout is disabled," because it clashes with the 1st
> bullet which also says "- If the timeout is disabled,".

This comments were reworded.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! New patch set can be available on [1].

> I rethought the pqSocketPoll part. Current interpretation of
> arguments seems a little bit confusing because a specific pattern
> of arguments has a different meaning. What do you think about
> introducing a new argument like `int forConnCheck`? This seems
> straightforward and readable.

I think it may be better, so fixed.
But now we must consider another thing - will we support combination of conncheck
and {read|write} or timeout? Especially about timeout, if someone calls pqSocketPoll()
with forConnCheck = 1 and end_time = -1, the process may be stuck because it waits
till socket peer closed connection.
Currently the forConnCheck can be specified with other request, but timeout must be zero.

> I think there are cases where the given server name does exist,
> however the connection check is not performed (does not pass the
> first if statement above). 1) a case where the server name exist,
> however an open connection to the specified server is not found.
> 2) case where connection for specified server is invalidated.
> 3) case where there is not ConnectionHash entry for the specified
> server. Current implementation returns true in that case, however
> the check is not performed. Suppose the checking mechanism is
> supported on the platform, it does not seem reasonable to return
> true or false when the check is not performed. What do you think?

Thank you for detailed explanation. I agreed your opinion and modified like that.

While making the patch, I come up with idea that postgres_fdw_verify_connection_states*
returns NULL if PQconnCheck() cannot work on this platform. This can be done by
adding PQconnCheckable(). It may be reasonable because the checking is not really
done on this environment. How do you think?

[1]:
https://www.postgresql.org/message-id/TYAPR01MB586664F5ED2128E572EA95B0F5AA9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
Hi Kuroda-san,

Thank you for updating the patch!


>> I rethought the pqSocketPoll part. Current interpretation of
>> arguments seems a little bit confusing because a specific pattern
>> of arguments has a different meaning. What do you think about
>> introducing a new argument like `int forConnCheck`? This seems
>> straightforward and readable.
> 
> I think it may be better, so fixed.
> But now we must consider another thing - will we support combination
> of conncheck
> and {read|write} or timeout? Especially about timeout, if someone
> calls pqSocketPoll()
> with forConnCheck = 1 and end_time = -1, the process may be stuck
> because it waits
> till socket peer closed connection.
> Currently the forConnCheck can be specified with other request, but
> timeout must be zero.

Yes, we need to consider these.
I'm wondering whether we need a special care for the combination
of event and timeout. Surely, if forConnCheck is set and end_time = -1,
pqSocketPoll blocks until the connection close. However, the
behavior matches the meaning of the arguments and does not seem
confusing (also not an error state). Do we need to restrict this
kind of usage in the pqSocketPoll side? I think like the following
might be fine.

```
      if (!forRead && !forWrite)
    {
        if (!(forConnCheck && PQconnCheckable()))
            return 0;
    }
```



> While making the patch, I come up with idea that
> postgres_fdw_verify_connection_states*
> returns NULL if PQconnCheck() cannot work on this platform. This can be 
> done by
> adding PQconnCheckable(). It may be reasonable because the checking is
> not really
> done on this environment. How do you think?

I agree with you.

Followings are comments for v33. Please check.
0001:
1. the comment of PQconnCheck
+/*
+ * Check whether the socket peer closed connection or not.
+ *

Check whether the socket peer closed 'the' connection or not?

2. the comment of pqSocketCheck
- * or both.  Returns >0 if one or more conditions are met, 0 if it 
timed
- * out, -1 if an error occurred.
+ * or both. Moreover, this function can check the health of socket on 
some
+ * limited platforms if end_time is 0.

the health of socket -> the health of the connection?
if end_time is 0 -> if forConnCehck is specified?


3. the comment of pqSocketPoll
- * If neither forRead nor forWrite are set, immediately return a 
timeout
- * condition (without waiting).  Return >0 if condition is met, 0
- * if a timeout occurred, -1 if an error or interrupt occurred.
+ * Moreover, this function can check the health of socket on some 
limited
+ * platforms if end_time is 0.

the health of socket -> the health of the connection?
if end_time is 0 -> if forConnCehck is specified?

4. the code of pqSocketPoll
+#if defined(POLLRDHUP)
+    if (forConnCheck)
+        input_fd.events |= POLLRDHUP;
+#endif

I think it is better to use PQconnCheckable() to remove the macro.

5. the code of pqSocketCheck and the definition of pqSocketPoll
-        result = pqSocketPoll(conn->sock, forRead, forWrite, end_time);
+        result = pqSocketPoll(conn->sock, forRead, forWrite, forConnCheck, 
end_time);

-pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+pqSocketPoll(int sock, int forRead, int forWrite, int forConnCheck, 
time_t end_time)

Should these be divided into two lines?


6. the comment of verify_cached_connections
+ * This function emits warnings if a disconnection is found, and this 
returns
+ * false only when the lastly verified server seems to be disconnected.

It seems better to write the case where this function returns
true.

7. the comment of postgres_fdw_verify_connection_states
+ * This function emits a warning if a disconnection is found. This 
returns
+ * false only when the verified server seems to be disconnected, and 
reutrns
+ * NULL if the connection check had not been done.

It seems better to write the case where this function returns
true.

8. the code of
+                (errcode(ERRCODE_CONNECTION_FAILURE),
+                 errmsg("%s", str.data),
+                 errdetail("Socket close is detected."),
+                 errhint("Plsease check the health of server.")));

Is it better to use "Connection close is detected" rather than
"Socket close is detected"?

9. the document of postgres_fdw
The document of postgres_fdw_verify_connection_states_* is a little
bit old. Could you update it?

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Hi Katsuragi-san,

Thank you for reviewing! PSA new version.

> >> I rethought the pqSocketPoll part. Current interpretation of
> >> arguments seems a little bit confusing because a specific pattern
> >> of arguments has a different meaning. What do you think about
> >> introducing a new argument like `int forConnCheck`? This seems
> >> straightforward and readable.
> >
> > I think it may be better, so fixed.
> > But now we must consider another thing - will we support combination
> > of conncheck
> > and {read|write} or timeout? Especially about timeout, if someone
> > calls pqSocketPoll()
> > with forConnCheck = 1 and end_time = -1, the process may be stuck
> > because it waits
> > till socket peer closed connection.
> > Currently the forConnCheck can be specified with other request, but
> > timeout must be zero.
> 
> Yes, we need to consider these.
> I'm wondering whether we need a special care for the combination
> of event and timeout. Surely, if forConnCheck is set and end_time = -1,
> pqSocketPoll blocks until the connection close. However, the
> behavior matches the meaning of the arguments and does not seem
> confusing (also not an error state). Do we need to restrict this
> kind of usage in the pqSocketPoll side? I think like the following
> might be fine.
> 
> ```
>       if (!forRead && !forWrite)
>     {
>         if (!(forConnCheck && PQconnCheckable()))
>             return 0;
>     }
> ```

Seems right, I was too pessimistic. Fixed.

> > While making the patch, I come up with idea that
> > postgres_fdw_verify_connection_states*
> > returns NULL if PQconnCheck() cannot work on this platform. This can be
> > done by
> > adding PQconnCheckable(). It may be reasonable because the checking is
> > not really
> > done on this environment. How do you think?
> 
> I agree with you.

Changed.

> Followings are comments for v33. Please check.
> 0001:
> 1. the comment of PQconnCheck
> +/*
> + * Check whether the socket peer closed connection or not.
> + *
> 
> Check whether the socket peer closed 'the' connection or not?

Changed.

> 2. the comment of pqSocketCheck
> - * or both.  Returns >0 if one or more conditions are met, 0 if it
> timed
> - * out, -1 if an error occurred.
> + * or both. Moreover, this function can check the health of socket on
> some
> + * limited platforms if end_time is 0.
> 
> the health of socket -> the health of the connection?
> if end_time is 0 -> if forConnCehck is specified?

Fixed.

> 3. the comment of pqSocketPoll
> - * If neither forRead nor forWrite are set, immediately return a
> timeout
> - * condition (without waiting).  Return >0 if condition is met, 0
> - * if a timeout occurred, -1 if an error or interrupt occurred.
> + * Moreover, this function can check the health of socket on some
> limited
> + * platforms if end_time is 0.
> 
> the health of socket -> the health of the connection?
> if end_time is 0 -> if forConnCehck is specified?

Fixed.

> 4. the code of pqSocketPoll
> +#if defined(POLLRDHUP)
> +    if (forConnCheck)
> +        input_fd.events |= POLLRDHUP;
> +#endif
> 
> I think it is better to use PQconnCheckable() to remove the macro.

IIUC the macro is needed. In FreeBSD, macOS and other platforms do not have the
macro POLLRDHUP so they cannot compile. I checked by my CI and got following error.

```
...
FAILED: src/interfaces/libpq/libpq.a.p/fe-misc.c.o 
...
../src/interfaces/libpq/fe-misc.c:1149:22: error: use of undeclared identifier 'POLLRDHUP'
                input_fd.events |= POLLRDHUP;
````

It must be invisible from them.

> 5. the code of pqSocketCheck and the definition of pqSocketPoll
> -        result = pqSocketPoll(conn->sock, forRead, forWrite,
> end_time);
> +        result = pqSocketPoll(conn->sock, forRead, forWrite,
> forConnCheck,
> end_time);
> 
> -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
> +pqSocketPoll(int sock, int forRead, int forWrite, int forConnCheck,
> time_t end_time)
> 
> Should these be divided into two lines?

pgindent did not say anything about them, but I divided. 

> 6. the comment of verify_cached_connections
> + * This function emits warnings if a disconnection is found, and this
> returns
> + * false only when the lastly verified server seems to be disconnected.

Updated. I think comments were not correct, so these were also fixed.

> It seems better to write the case where this function returns
> true.
> 7. the comment of postgres_fdw_verify_connection_states
> + * This function emits a warning if a disconnection is found. This
> returns
> + * false only when the verified server seems to be disconnected, and
> reutrns
> + * NULL if the connection check had not been done.
> 
> It seems better to write the case where this function returns
> true.


> 8. the code of
> +                (errcode(ERRCODE_CONNECTION_FAILURE),
> +                 errmsg("%s", str.data),
> +                 errdetail("Socket close is detected."),
> +                 errhint("Plsease check the health of
> server.")));
> 
> Is it better to use "Connection close is detected" rather than
> "Socket close is detected"?

Changed.

> 9. the document of postgres_fdw
> The document of postgres_fdw_verify_connection_states_* is a little
> bit old. Could you update it?

Updated. IIUC postgres_fdw_verify_connection_states returns

* true, if the connection is verified.
* false, if the connection seems to be disconnected.
* NULL, if this is not the supported platform or connection has not been established.

And postgres_fdw_verify_connection_states_all returns

* true if all the connections are verified.
* false, if one of connections seems to be disconnected.
* NULL, if this is not the supported platform or this backend has never established connections

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
Hi Kuroda-san,

Thank you for updating the patch!

>> 4. the code of pqSocketPoll
>> +#if defined(POLLRDHUP)
>> +    if (forConnCheck)
>> +        input_fd.events |= POLLRDHUP;
>> +#endif
>> 
>> I think it is better to use PQconnCheckable() to remove the macro.
> 
> IIUC the macro is needed. In FreeBSD, macOS and other platforms do not 
> have the
> macro POLLRDHUP so they cannot compile. I checked by my CI and got
> following error.
> 
> ```
> ...
> FAILED: src/interfaces/libpq/libpq.a.p/fe-misc.c.o
> ...
> ../src/interfaces/libpq/fe-misc.c:1149:22: error: use of undeclared
> identifier 'POLLRDHUP'
>                 input_fd.events |= POLLRDHUP;
> ````
> 
> It must be invisible from them.

Sorry, my mistake...


>> 9. the document of postgres_fdw
>> The document of postgres_fdw_verify_connection_states_* is a little
>> bit old. Could you update it?
> 
> Updated. IIUC postgres_fdw_verify_connection_states returns
> 
> * true, if the connection is verified.
> * false, if the connection seems to be disconnected.
> * NULL, if this is not the supported platform or connection has not
> been established.
> 
> And postgres_fdw_verify_connection_states_all returns
> 
> * true if all the connections are verified.
> * false, if one of connections seems to be disconnected.
> * NULL, if this is not the supported platform or this backend has
> never established connections

I think 'backend has never established connections' is a little bit 
strong.
I think the following cases also return NULL. The case where a
connection was established, however the connection is now closed
by the postgres_fdw_disconnect() or something. NULL is also returned if
the connection is invalidated. So, I think 'NULL, if no valid
connection is established or the function is not supported on
the platform.' is better. What do you think?


Followings are my comments for v34. Please check.

0001:
1. the document of pqConnCheck
+       <literal>0</literal> if the socket is valid, and returns 
<literal>-1</literal>
+       if the connection has already been closed or an error has 
occurred.

1.1 if the socket is valid -> returns 0 if the 'connection is not 
closed'?

1.2 returns -1 if the connection has already been closed  <- Let me ask 
a question.
Isn't this a situation where we would like to check using this
function? Is 'error has occurred' insufficient?

2. the comment of pqSocketCheck
+ * means that the socket has not matched POLLRDHUP event and the socket 
has
+ * still survived.

socket has still survived -> connection is not closed by the socket 
peer?

3. the comment of pqSocketPoll
+ * returned and forConnCheck is requested, it means that the socket has 
not
+ * matched POLLRDHUP event and the socket has still survived.

socket has still survived -> connection is not closed by the socket 
peer?

0002:
4. the comment of verify_cached_connections
+ * This function emits warnings if a disconnection is found. This 
return true
+ * if disconnections cannot be found, otherwise return false.

return ture -> return's' true
return false -> return's' false

5. the comment of postgres_fdw_verify_connection_states
+ * if the local session seems to be disconnected from other servers. 
NULL is
+ * returned if a connection to the specified foreign server has not 
been
+ * established yet, or this function is not available on this platform.

Considering the above discussion, 'NULL is returned if a valid
connection to the specified foreign server is not established or
this function...' seems better. What do you think?

6. the document of postgres_fdw_verify_connection_states
  <literal>NULL</literal>
+      is returned if a connection to the specified foreign server has 
not been
+      established yet, or this function is not available on this 
platform

The same as comment no.5.

7. the document of postgres_fdw_verify_connection_states_all
<literal>NULL</literal>
+      is returned if the local session does not have connection caches, 
or this
+      function is not available on this platform.

I think there is a case where a connection cache exists but valid
connections do not exist and NULL is returned (disconnection case).
Almost the same document as the postgres_fdw_verify_connection_states
case (comment no.5) seems better.


8. the document of postgres_fdw_can_verify_connection_states
+      This function checks whether 
<function>postgres_fdw_verify_connection_states</function>
+      and <function>postgres_fdw_verify_connection_states</function> 
work well

Should the latter (or former) postgres_fdw_verify_connection_states be
postgres_fdw_verify_connection_states_all?


regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! PSA new version.

> >> 4. the code of pqSocketPoll
> >> +#if defined(POLLRDHUP)
> >> +    if (forConnCheck)
> >> +        input_fd.events |= POLLRDHUP;
> >> +#endif
> >>
> >> I think it is better to use PQconnCheckable() to remove the macro.
> >
> > IIUC the macro is needed. In FreeBSD, macOS and other platforms do not
> > have the
> > macro POLLRDHUP so they cannot compile. I checked by my CI and got
> > following error.
> >
> > ```
> > ...
> > FAILED: src/interfaces/libpq/libpq.a.p/fe-misc.c.o
> > ...
> > ../src/interfaces/libpq/fe-misc.c:1149:22: error: use of undeclared
> > identifier 'POLLRDHUP'
> >                 input_fd.events |= POLLRDHUP;
> > ````
> >
> > It must be invisible from them.
> 
> Sorry, my mistake...

No issues :-).

> >> 9. the document of postgres_fdw
> >> The document of postgres_fdw_verify_connection_states_* is a little
> >> bit old. Could you update it?
> >
> > Updated. IIUC postgres_fdw_verify_connection_states returns
> >
> > * true, if the connection is verified.
> > * false, if the connection seems to be disconnected.
> > * NULL, if this is not the supported platform or connection has not
> > been established.
> >
> > And postgres_fdw_verify_connection_states_all returns
> >
> > * true if all the connections are verified.
> > * false, if one of connections seems to be disconnected.
> > * NULL, if this is not the supported platform or this backend has
> > never established connections
> 
> I think 'backend has never established connections' is a little bit
> strong.
> I think the following cases also return NULL. The case where a
> connection was established, however the connection is now closed
> by the postgres_fdw_disconnect() or something. NULL is also returned if
> the connection is invalidated. So, I think 'NULL, if no valid
> connection is established or the function is not supported on
> the platform.' is better. What do you think?

Disconnect functions have never been in my mind. Descriptions must be updated.

> Followings are my comments for v34. Please check.
> 
> 0001:
> 1. the document of pqConnCheck
> +       <literal>0</literal> if the socket is valid, and returns
> <literal>-1</literal>
> +       if the connection has already been closed or an error has
> occurred.
> 
> 1.1 if the socket is valid -> returns 0 if the 'connection is not
> closed'?

Fixed.

> 1.2 returns -1 if the connection has already been closed  <- Let me ask
> a question.
> Isn't this a situation where we would like to check using this
> function? Is 'error has occurred' insufficient?

Seems right, fixed.

> 2. the comment of pqSocketCheck
> + * means that the socket has not matched POLLRDHUP event and the socket
> has
> + * still survived.
> 
> socket has still survived -> connection is not closed by the socket
> peer?

Fixed.

> 3. the comment of pqSocketPoll
> + * returned and forConnCheck is requested, it means that the socket has
> not
> + * matched POLLRDHUP event and the socket has still survived.
> 
> socket has still survived -> connection is not closed by the socket
> peer?

Fixed.

> 0002:
> 4. the comment of verify_cached_connections
> + * This function emits warnings if a disconnection is found. This
> return true
> + * if disconnections cannot be found, otherwise return false.
> 
> return ture -> return's' true
> return false -> return's' false

Fixed.

> 5. the comment of postgres_fdw_verify_connection_states
> + * if the local session seems to be disconnected from other servers.
> NULL is
> + * returned if a connection to the specified foreign server has not
> been
> + * established yet, or this function is not available on this platform.
> 
> Considering the above discussion, 'NULL is returned if a valid
> connection to the specified foreign server is not established or
> this function...' seems better. What do you think?

Right, fixed.

> 6. the document of postgres_fdw_verify_connection_states
>   <literal>NULL</literal>
> +      is returned if a connection to the specified foreign server has
> not been
> +      established yet, or this function is not available on this
> platform
> 
> The same as comment no.5.

Right, fixed.

> 7. the document of postgres_fdw_verify_connection_states_all
> <literal>NULL</literal>
> +      is returned if the local session does not have connection caches,
> or this
> +      function is not available on this platform.
> 
> I think there is a case where a connection cache exists but valid
> connections do not exist and NULL is returned (disconnection case).
> Almost the same document as the postgres_fdw_verify_connection_states
> case (comment no.5) seems better.

Yes, but completely same statement cannot be used because these is not
specified foreign server. How about: 
NULL is returned if there are no established connections or this function ...

> 8. the document of postgres_fdw_can_verify_connection_states
> +      This function checks whether
> <function>postgres_fdw_verify_connection_states</function>
> +      and <function>postgres_fdw_verify_connection_states</function>
> work well
> 
> Should the latter (or former) postgres_fdw_verify_connection_states be
> postgres_fdw_verify_connection_states_all?

That was copy-and-paste error, fixed.

> regards,
> 
> --
> Katsuragi Yuta
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
 

Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
Hi Kuroda-san,

Thank you for updating the patch!

I think we can update the status to ready for committer after
this fix, if there is no objection.

>> 7. the document of postgres_fdw_verify_connection_states_all
>> <literal>NULL</literal>
>> +      is returned if the local session does not have connection 
>> caches,
>> or this
>> +      function is not available on this platform.
>> 
>> I think there is a case where a connection cache exists but valid
>> connections do not exist and NULL is returned (disconnection case).
>> Almost the same document as the postgres_fdw_verify_connection_states
>> case (comment no.5) seems better.
> 
> Yes, but completely same statement cannot be used because these is not
> specified foreign server. How about:
> NULL is returned if there are no established connections or this 
> function ...

Yes, to align with the postgres_fdw_verify_connection_states()
case, how about writing the connection is not valid. Like the
following?
'NULL is returned if no valid connections are established or
this function...'


This is my comment for v35. Please check.
0002:
1. the comment of verify_cached_connections (I missed one minor point.)
+ * This function emits warnings if a disconnection is found. This 
returns true
+ * if disconnections cannot be found, otherwise returns false.

I think false is returned only if disconnections are found and
true is returned in all other cases. So, modifying the description
like the following seems better.
'This returns false if disconnections are found, otherwise
returns true.'


regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

Thank you for reviewing! PSA new version patches.

> I think we can update the status to ready for committer after
> this fix, if there is no objection.

That's a very good news for me! How about other people?

> >> 7. the document of postgres_fdw_verify_connection_states_all
> >> <literal>NULL</literal>
> >> +      is returned if the local session does not have connection
> >> caches,
> >> or this
> >> +      function is not available on this platform.
> >>
> >> I think there is a case where a connection cache exists but valid
> >> connections do not exist and NULL is returned (disconnection case).
> >> Almost the same document as the postgres_fdw_verify_connection_states
> >> case (comment no.5) seems better.
> >
> > Yes, but completely same statement cannot be used because these is not
> > specified foreign server. How about:
> > NULL is returned if there are no established connections or this
> > function ...
> 
> Yes, to align with the postgres_fdw_verify_connection_states()
> case, how about writing the connection is not valid. Like the
> following?
> 'NULL is returned if no valid connections are established or
> this function...'

Prefer yours, fixed.

> This is my comment for v35. Please check.
> 0002:
> 1. the comment of verify_cached_connections (I missed one minor point.)
> + * This function emits warnings if a disconnection is found. This
> returns true
> + * if disconnections cannot be found, otherwise returns false.
> 
> I think false is returned only if disconnections are found and
> true is returned in all other cases. So, modifying the description
> like the following seems better.
> 'This returns false if disconnections are found, otherwise
> returns true.'

Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
vignesh C
Date:
On Tue, 7 Mar 2023 at 09:53, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Katsuragi-san,
>
> Thank you for reviewing! PSA new version patches.
>
> > I think we can update the status to ready for committer after
> > this fix, if there is no objection.
>
> That's a very good news for me! How about other people?
>
> > >> 7. the document of postgres_fdw_verify_connection_states_all
> > >> <literal>NULL</literal>
> > >> +      is returned if the local session does not have connection
> > >> caches,
> > >> or this
> > >> +      function is not available on this platform.
> > >>
> > >> I think there is a case where a connection cache exists but valid
> > >> connections do not exist and NULL is returned (disconnection case).
> > >> Almost the same document as the postgres_fdw_verify_connection_states
> > >> case (comment no.5) seems better.
> > >
> > > Yes, but completely same statement cannot be used because these is not
> > > specified foreign server. How about:
> > > NULL is returned if there are no established connections or this
> > > function ...
> >
> > Yes, to align with the postgres_fdw_verify_connection_states()
> > case, how about writing the connection is not valid. Like the
> > following?
> > 'NULL is returned if no valid connections are established or
> > this function...'
>
> Prefer yours, fixed.
>
> > This is my comment for v35. Please check.
> > 0002:
> > 1. the comment of verify_cached_connections (I missed one minor point.)
> > + * This function emits warnings if a disconnection is found. This
> > returns true
> > + * if disconnections cannot be found, otherwise returns false.
> >
> > I think false is returned only if disconnections are found and
> > true is returned in all other cases. So, modifying the description
> > like the following seems better.
> > 'This returns false if disconnections are found, otherwise
> > returns true.'
>
> Fixed.

Few comments:
1) There is no handling of forConnCheck in #else HAVE_POLL, if this is
intentional we could add some comments for the same:
 static int
-pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+pqSocketPoll(int sock, int forRead,
+                        int forWrite, int forConnCheck, time_t end_time)
 {
        /* We use poll(2) if available, otherwise select(2) */
 #ifdef HAVE_POLL
@@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int
forWrite, time_t end_time)
        int                     timeout_ms;

        if (!forRead && !forWrite)
-               return 0;

2) Can this condition be added to the parent if condition:
        if (!forRead && !forWrite)
-               return 0;
+       {
+               /* Connection check can be available on some limted platforms */
+               if (!(forConnCheck && PQconnCheckable()))
+                       return 0;
+       }

3) Can we add a comment for PQconnCheckable:
+/* Check whether the postgres server is still alive or not */
+extern int PQconnCheck(PGconn *conn);
+extern int PQconnCheckable(void);
+

4) "Note that if 0 is returned and forConnCheck is requested" doesn't
sound right, it can be changed to "Note that if 0 is returned when
forConnCheck is requested"
+ * If neither forRead, forWrite nor forConnCheck are set, immediately return a
+ * timeout condition (without waiting). Return >0 if condition is met, 0 if a
+ * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is
+ * returned and forConnCheck is requested, it means that the socket has not
+ * matched POLLRDHUP event and the connection is not closed by the socket peer.

Regards,
Vignesh



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Vignesh,

Thank you for reviewing! PSA new version.

> 
> Few comments:
> 1) There is no handling of forConnCheck in #else HAVE_POLL, if this is
> intentional we could add some comments for the same:
>  static int
> -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
> +pqSocketPoll(int sock, int forRead,
> +                        int forWrite, int forConnCheck, time_t end_time)
>  {
>         /* We use poll(2) if available, otherwise select(2) */
>  #ifdef HAVE_POLL
> @@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int
> forWrite, time_t end_time)
>         int                     timeout_ms;
> 
>         if (!forRead && !forWrite)
> -               return 0;

Comments were added. This missing is intentional, because with the patch present
we do not implement checking feature for kqueue(). If you want to check the
detailed implementation of that, please see 0004 patch attached on [1].

> 2) Can this condition be added to the parent if condition:
>         if (!forRead && !forWrite)
> -               return 0;
> +       {
> +               /* Connection check can be available on some limted platforms
> */
> +               if (!(forConnCheck && PQconnCheckable()))
> +                       return 0;
> +       }

Changed, and added comments atop the if-statement.

> 3) Can we add a comment for PQconnCheckable:
> +/* Check whether the postgres server is still alive or not */
> +extern int PQconnCheck(PGconn *conn);
> +extern int PQconnCheckable(void);
> +

Added. And I found the tab should be used to divide data type and name, so fixed.

> 4) "Note that if 0 is returned and forConnCheck is requested" doesn't
> sound right, it can be changed to "Note that if 0 is returned when
> forConnCheck is requested"
> + * If neither forRead, forWrite nor forConnCheck are set, immediately return a
> + * timeout condition (without waiting). Return >0 if condition is met, 0 if a
> + * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is
> + * returned and forConnCheck is requested, it means that the socket has not
> + * matched POLLRDHUP event and the connection is not closed by the socket
> peer.

Fixed.

[1]:
https://www.postgresql.org/message-id/TYAPR01MB58669EAAC02493BFF9F39B06F5D99%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-03-08 13:40, Hayato Kuroda (Fujitsu) wrote:
> Dear Vignesh,
> 
> Thank you for reviewing! PSA new version.

Hi,

Thank you for the comments, Vignesh.
Thank you for updating the patch, Kuroda-san. This fix looks fine to me.

And also, there seems no other comments from this post [1].
So, I'm planning to update the status to ready for committer
on next Monday.


[1]: 
https://www.postgresql.org/message-id/TYAPR01MB5866F8EA3C06E1B43E42E4E4F5B79%40TYAPR01MB5866.jpnprd01.prod.outlook.com

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Katsuragi Yuta
Date:
On 2023-03-10 18:07, Katsuragi Yuta wrote:
> On 2023-03-08 13:40, Hayato Kuroda (Fujitsu) wrote:
>> Dear Vignesh,
>> 
>> Thank you for reviewing! PSA new version.
> 
> Hi,
> 
> Thank you for the comments, Vignesh.
> Thank you for updating the patch, Kuroda-san. This fix looks fine to 
> me.
> 
> And also, there seems no other comments from this post [1].
> So, I'm planning to update the status to ready for committer
> on next Monday.
> 
> 
> [1]:
>
https://www.postgresql.org/message-id/TYAPR01MB5866F8EA3C06E1B43E42E4E4F5B79%40TYAPR01MB5866.jpnprd01.prod.outlook.com
> 
> regards,

Hi,

I updated the status of the patch to ready for committer.

regards,

-- 
Katsuragi Yuta
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Katsuragi-san,

> Hi,
> 
> I updated the status of the patch to ready for committer.
> 
> regards,

Thank you so much for your reviewing!
Now we can wait comments from senior members and committers.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2023/03/13 16:05, Hayato Kuroda (Fujitsu) wrote:
> Thank you so much for your reviewing!
> Now we can wait comments from senior members and committers.

Thanks for working on this patch!

Regarding 0001 patch, on second thought, to me, it seems odd to expose
a function that doesn't have anything to directly do with PostgreSQL
as a libpq function. The function simply calls poll() on the socket
with POLLRDHUP if it is supported. While it is certainly convenient to
have this function, I'm not sure that it fits within the scope of libpq.
Thought?

Regarding 0002 patch, the behavior of postgres_fdw_verify_connection_states()
in regards to multiple connections using different user mappings seems
not well documented. The function seems to return false if either of
those connections has been closed.

This behavior means that it's difficult to identify which connection
has been closed when there are multiple ones to the given server.
To make it easier to identify that, it could be helpful to extend
the postgres_fdw_verify_connection_states() function so that it accepts
a unique connection as an input instead of just the server name.
One suggestion is to extend the function so that it accepts
both the server name and the user name used for the connection,
and checks the specified connection. If only the server name is specified,
the function should check all connections to the server and return false
if any of them are closed. This would be helpful since there is typically
only one connection to the server in most cases.

Additionally, it would be helpful to extend the postgres_fdw_get_connections()
function to also return the user name used for each connection,
as currently there is no straightforward way to identify it.

The function name "postgres_fdw_verify_connection_states" may seem
unnecessarily long to me. A simpler name like
"postgres_fdw_verify_connection" may be enough?

The patch may not be ready for commit due to the review comments,
and with the feature freeze approaching in a few days,
it may not be possible to include this feature in v16.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Tom Lane
Date:
Fujii Masao <masao.fujii@oss.nttdata.com> writes:
> Regarding 0001 patch, on second thought, to me, it seems odd to expose
> a function that doesn't have anything to directly do with PostgreSQL
> as a libpq function. The function simply calls poll() on the socket
> with POLLRDHUP if it is supported. While it is certainly convenient to
> have this function, I'm not sure that it fits within the scope of libpq.
> Thought?

Yeah, that does not seem great, partly because the semantics would be
platform-dependent.  I don't think we want that to become part of
libpq's API.

            regards, tom lane



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san, Tom,

Thank you for giving a suggestion! PSA new version.

> Regarding 0001 patch, on second thought, to me, it seems odd to expose
> a function that doesn't have anything to directly do with PostgreSQL
> as a libpq function. The function simply calls poll() on the socket
> with POLLRDHUP if it is supported. While it is certainly convenient to
> have this function, I'm not sure that it fits within the scope of libpq.
> Thought?

Current style is motivated by Onder [1] and later discussions. I thought it might
be useful for other developers, but OK, I can remove changes on libpq modules.
Horiguchi-san has suggested [2] that it might be overkill to use WaitEventSet()
mechanism, so I kept using poll().
I reused the same naming as previous version because they actually do something
Like libpq, but better naming is very welcome.

> Regarding 0002 patch, the behavior of postgres_fdw_verify_connection_states()
> in regards to multiple connections using different user mappings seems
> not well documented. The function seems to return false if either of
> those connections has been closed.

I did not considered the situation because I have not came up with the situation
that only one of connections to the same foreign server is broken.

> This behavior means that it's difficult to identify which connection
> has been closed when there are multiple ones to the given server.
> To make it easier to identify that, it could be helpful to extend
> the postgres_fdw_verify_connection_states() function so that it accepts
> a unique connection as an input instead of just the server name.
> One suggestion is to extend the function so that it accepts
> both the server name and the user name used for the connection,
> and checks the specified connection. If only the server name is specified,
> the function should check all connections to the server and return false
> if any of them are closed. This would be helpful since there is typically
> only one connection to the server in most cases.

Just to confirm, your point "user name" means a local user, right?
I made a patch for addressing them.

> Additionally, it would be helpful to extend the postgres_fdw_get_connections()
> function to also return the user name used for each connection,
> as currently there is no straightforward way to identify it.

Added, See 0003. IIUC there is no good way to extract user mapping from its OID, so I have
added an function to do that and used it.

> The function name "postgres_fdw_verify_connection_states" may seem
> unnecessarily long to me. A simpler name like
> "postgres_fdw_verify_connection" may be enough?

Renamed.

> The patch may not be ready for commit due to the review comments,
> and with the feature freeze approaching in a few days,
> it may not be possible to include this feature in v16.

It is sad for me, but it is more important for PostgreSQL to add nicer codes.
I changed status to "Needs review" again.

[1]: https://www.postgresql.org/message-id/CACawEhW19nPfbFpvfke9eidFDxAy%2Bic36wmY0s936T%3DxzxgHog%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/20221017.172642.45253962719866795.horikyota.ntt%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear hackers,

> Dear Fujii-san, Tom,
> 
> Thank you for giving a suggestion! PSA new version.

I have reviewed and revised patches by myself.

* Fix handing of poll() based on the Horiguchi-san's point
* Fix WARNING message that shows user name which is used for connection
* PQconnCheck(), PQconnCheckable () are renamed

[1]: https://www.postgresql.org/message-id/flat/20230209.115009.2229702014236187289.horikyota.ntt%40gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Shubham Khanna
Date:
On Mon, Dec 11, 2023 at 2:08 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Fujii-san, Tom,
>
> Thank you for giving a suggestion! PSA new version.
>
> > Regarding 0001 patch, on second thought, to me, it seems odd to expose
> > a function that doesn't have anything to directly do with PostgreSQL
> > as a libpq function. The function simply calls poll() on the socket
> > with POLLRDHUP if it is supported. While it is certainly convenient to
> > have this function, I'm not sure that it fits within the scope of libpq.
> > Thought?
>
> Current style is motivated by Onder [1] and later discussions. I thought it might
> be useful for other developers, but OK, I can remove changes on libpq modules.
> Horiguchi-san has suggested [2] that it might be overkill to use WaitEventSet()
> mechanism, so I kept using poll().
> I reused the same naming as previous version because they actually do something
> Like libpq, but better naming is very welcome.
>
> > Regarding 0002 patch, the behavior of postgres_fdw_verify_connection_states()
> > in regards to multiple connections using different user mappings seems
> > not well documented. The function seems to return false if either of
> > those connections has been closed.
>
> I did not considered the situation because I have not came up with the situation
> that only one of connections to the same foreign server is broken.
>
> > This behavior means that it's difficult to identify which connection
> > has been closed when there are multiple ones to the given server.
> > To make it easier to identify that, it could be helpful to extend
> > the postgres_fdw_verify_connection_states() function so that it accepts
> > a unique connection as an input instead of just the server name.
> > One suggestion is to extend the function so that it accepts
> > both the server name and the user name used for the connection,
> > and checks the specified connection. If only the server name is specified,
> > the function should check all connections to the server and return false
> > if any of them are closed. This would be helpful since there is typically
> > only one connection to the server in most cases.
>
> Just to confirm, your point "user name" means a local user, right?
> I made a patch for addressing them.
>
> > Additionally, it would be helpful to extend the postgres_fdw_get_connections()
> > function to also return the user name used for each connection,
> > as currently there is no straightforward way to identify it.
>
> Added, See 0003. IIUC there is no good way to extract user mapping from its OID, so I have
> added an function to do that and used it.
>
> > The function name "postgres_fdw_verify_connection_states" may seem
> > unnecessarily long to me. A simpler name like
> > "postgres_fdw_verify_connection" may be enough?
>
> Renamed.
>
> > The patch may not be ready for commit due to the review comments,
> > and with the feature freeze approaching in a few days,
> > it may not be possible to include this feature in v16.
>
> It is sad for me, but it is more important for PostgreSQL to add nicer codes.
> I changed status to "Needs review" again.

I am failing to apply the latest
Patch-"v39-0001-postgres_fdw-add-postgres_fdw_verify_connection-.patch"
 for review. Please find the error I am facing:
D:\Project\Postgres>git am
D:\Project\Patch\v39-0001-postgres_fdw-add-postgres_fdw_verify_connection-.patch
error: patch failed: contrib/postgres_fdw/connection.c:117
error: contrib/postgres_fdw/connection.c: patch does not apply
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Applying: postgres_fdw: add postgres_fdw_verify_connection variants
Patch failed at 0001 postgres_fdw: add postgres_fdw_verify_connection variants
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Please rebase and post an updated version of the Patch.

Thanks and Regards,
Shubham Khanna.



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Shubham,

> 
> I am failing to apply the latest
> Patch-"v39-0001-postgres_fdw-add-postgres_fdw_verify_connection-.patch"
>  for review. Please find the error I am facing:
> D:\Project\Postgres>git am
> D:\Project\Patch\v39-0001-postgres_fdw-add-postgres_fdw_verify_connection
> -.patch
> error: patch failed: contrib/postgres_fdw/connection.c:117
> error: contrib/postgres_fdw/connection.c: patch does not apply
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> Applying: postgres_fdw: add postgres_fdw_verify_connection variants
> Patch failed at 0001 postgres_fdw: add postgres_fdw_verify_connection variants
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".

Oh, good catch. Here is a new patch set. There are no new changes from v39.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2023/12/12 11:43, Hayato Kuroda (Fujitsu) wrote:
> Dear Shubham,
> 
>>
>> I am failing to apply the latest
>> Patch-"v39-0001-postgres_fdw-add-postgres_fdw_verify_connection-.patch"
>>   for review. Please find the error I am facing:
>> D:\Project\Postgres>git am
>> D:\Project\Patch\v39-0001-postgres_fdw-add-postgres_fdw_verify_connection
>> -.patch
>> error: patch failed: contrib/postgres_fdw/connection.c:117
>> error: contrib/postgres_fdw/connection.c: patch does not apply
>> hint: Use 'git am --show-current-patch=diff' to see the failed patch
>> Applying: postgres_fdw: add postgres_fdw_verify_connection variants
>> Patch failed at 0001 postgres_fdw: add postgres_fdw_verify_connection variants
>> When you have resolved this problem, run "git am --continue".
>> If you prefer to skip this patch, run "git am --skip" instead.
>> To restore the original branch and stop patching, run "git am --abort".
> 
> Oh, good catch. Here is a new patch set. There are no new changes from v39.

Thanks for the patches!

I've just started reviewing them.

Here are the review comments for 0001 patch:

Do we really need postgres_fdw_verify_connection_all()? The proposed feature
aims to check if all postgres_fdw connections used within the transaction
are still open. If any of those connections are closed, the transaction
can't be committed successfully, so users can roll back immediately upon
detecting a closed connection.

However, postgres_fdw_verify_connection_all() checks all connections made in
the session, not just those used in the current transaction. This means
users can't determine if they should roll back the transaction based on
its return value. Therefore, I'm concerned that
postgres_fdw_verify_connection_all() might not be very useful. Thoughts?

Considering the purpose of this feature, wouldn't it be better to extend
postgres_fdw_get_connections() to include a "used_in_xact" column
(indicating whether the connection has been used in the current transaction)
and a "closed" column (indicating whether the connection has been closed)?
This approach might be more effective than introducing a new function
like the postgres_fdw_verify_connection family?

If it's too much to check if the connection is closed by default whenever
calling postgres_fdw_get_connections(), we could modify it to accept
an argument indicating whether to perform this check. Thoughts?


Here are the review comments for 0003 patch:

The source comment in postgres_fdw_get_connections() should mention
the return value user_name.

We also need to handle the case where the user mapping used by
the connection cache has been dropped. Otherwise, this could
lead to an error.

-----------------------------
=# BEGIN;
=*# SELECT * FROM postgres_fdw_get_connections();
  server_name | user_name | valid
-------------+-----------+-------
  loopback    | public    | t
(1 row)

=*# DROP USER MAPPING FOR public SERVER loopback ;
=*# SELECT * FROM postgres_fdw_get_connections();
ERROR:  cache lookup failed for user mapping 16409
-----------------------------

-SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
+SELECT server_name, valid FROM postgres_fdw_get_connections() ORDER BY 1;

Shouldn't this test also check if the returned user_name is valid?

+ server_name | user_name | validvalid
+-------------+------------------------
+ loopback1   | postgres  | t
+ loopback2   |           | f

The column name "validvalid" should be "valid".

How can we cause the record with server_name != NULL but user_name = NULL?


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

Hi, long time no see :-).
Thanks for reviewing the patch! PSA new version.

> I've just started reviewing them.
> 
> Here are the review comments for 0001 patch:
> 
> Do we really need postgres_fdw_verify_connection_all()? The proposed feature
> aims to check if all postgres_fdw connections used within the transaction
> are still open. If any of those connections are closed, the transaction
> can't be committed successfully, so users can roll back immediately upon
> detecting a closed connection.
> 
> However, postgres_fdw_verify_connection_all() checks all connections made in
> the session, not just those used in the current transaction. This means
> users can't determine if they should roll back the transaction based on
> its return value. Therefore, I'm concerned that
> postgres_fdw_verify_connection_all() might not be very useful. Thoughts?

Right. My primal motivation is to detect the disconnection from remotes and abort
current transaction as soon as possible. In this case, as I posted in [1],
...all()  is not so helpful.
I agreed to remove the function from patch set. Done.

> Considering the purpose of this feature, wouldn't it be better to extend
> postgres_fdw_get_connections() to include a "used_in_xact" column
> (indicating whether the connection has been used in the current transaction)
> and a "closed" column (indicating whether the connection has been closed)?
> This approach might be more effective than introducing a new function
> like the postgres_fdw_verify_connection family?
> 
> If it's too much to check if the connection is closed by default whenever
> calling postgres_fdw_get_connections(), we could modify it to accept
> an argument indicating whether to perform this check. Thoughts?

Sounds interesting. If we can accept to change the definition of pre-existing
function, it seems better. To keep the default behavior, an input parameter
should be added. Attached patch tried to implement.

> Here are the review comments for 0003 patch:
> 
> The source comment in postgres_fdw_get_connections() should mention
> the return value user_name.

Document was updated.

> We also need to handle the case where the user mapping used by
> the connection cache has been dropped. Otherwise, this could
> lead to an error.
> 
> -----------------------------
> =# BEGIN;
> =*# SELECT * FROM postgres_fdw_get_connections();
>   server_name | user_name | valid
> -------------+-----------+-------
>   loopback    | public    | t
> (1 row)
> 
> =*# DROP USER MAPPING FOR public SERVER loopback ;
> =*# SELECT * FROM postgres_fdw_get_connections();
> ERROR:  cache lookup failed for user mapping 16409
> -----------------------------

Fixed.

> -SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
> +SELECT server_name, valid FROM postgres_fdw_get_connections() ORDER BY
> 1;
> 
> Shouldn't this test also check if the returned user_name is valid?

You meant to say that we should print the user_name, right? Done.

> + server_name | user_name | validvalid
> +-------------+------------------------
> + loopback1   | postgres  | t
> + loopback2   |           | f
> 
> The column name "validvalid" should be "valid".

Right, fixed.

> How can we cause the record with server_name != NULL but user_name = NULL?

Now this can happen when user_name is dropped, but the example was updated.

Below contains a summary of changes.
0001:
- Instead of adding new functions, postgres_fdw_get_connections() was extended.
  Some tricks were added to support old versions. I followed pg_stat_statements.c.
  Attributes were added after the `valid` to preserve ordering of the old version.
- I found an inconsistency of name between source and doc,
  so I unified to postgres_fdw_can_verify_connection().
- Also, test patch (0002) was combined into this.
0002:
- user_name was added after the `valid` to preserve ordering of the old version.
- GetUserMappingFromOid() is allowed to miss a tuple.


[1]:
https://www.postgresql.org/message-id/TYAPR01MB58668728393648C2F7DC7C85F5399%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/18 19:49, Hayato Kuroda (Fujitsu) wrote:
>> Shouldn't this test also check if the returned user_name is valid?
> 
> You meant to say that we should print the user_name, right? Done.

Yes, I think it's better to test if the value in the user_name column is as expected.


> - I found an inconsistency of name between source and doc,
>    so I unified to postgres_fdw_can_verify_connection().

I'm unsure about the necessity of introducing a standalone function to check
POLLRDHUP availability. Instead of providing postgres_fdw_can_verify_connection(),
could we modify postgres_fdw_get_connections() to return NULL in the "closed"
column on platforms where POLLRDHUP isn't supported?


> - Also, test patch (0002) was combined into this.
> 0002:
> - user_name was added after the `valid` to preserve ordering of the old version.

Do we really need to keep this ordering? Wouldn't it be more intuitive to
have the user_name column next to server_name? In pg_stat_statements,
for example, the ordering isn't always preserved, as seen with
WAL-related columns being added in the middle.


> Thanks for reviewing the patch! PSA new version.

Thanks for updating the patches!


The regression test for postgres_fdw failed with the following diff.

----------------------
  SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
   server_name | valid | user_name | used_in_xact | closed
  -------------+-------+-----------+--------------+--------
- loopback    | f     | hayato    | t            |
+ loopback    | f     | runner    | t            |
               | f     |           | t            |
  (2 rows)
----------------------


+             * If requested and the connection is not invalidated, check the
+             * status of the remote connection from the backend process and
+             * return the result. Otherwise returns NULL.
+             */
+            if (require_verify && !entry->invalidated && entry->conn)

Should we also consider checking invalidated connections? Even though
a connection is marked as invalidated, it can still be used until
the current transaction ends. Therefore, if an invalidated connection
is used in this transaction (i.e., used_in_xact = true) and has
already been closed (closed = true), users might want to consider
rolling back the transaction promptly. Thought?


+        -- Set client_min_messages to ERROR temporary because the following
+        -- function only throws a WARNING on the supported platform.

Is this still true? From my reading of the code, it doesn't appear
that the function throws a WARNING.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

> On 2024/07/18 19:49, Hayato Kuroda (Fujitsu) wrote:
> >> Shouldn't this test also check if the returned user_name is valid?
> >
> > You meant to say that we should print the user_name, right? Done.
> 
> Yes, I think it's better to test if the value in the user_name column is as expected.

But this might cause a test failure by cfbot.... See below comment.

> > - I found an inconsistency of name between source and doc,
> >    so I unified to postgres_fdw_can_verify_connection().
> 
> I'm unsure about the necessity of introducing a standalone function to check
> POLLRDHUP availability. Instead of providing
> postgres_fdw_can_verify_connection(),
> could we modify postgres_fdw_get_connections() to return NULL in the "closed"
> column on platforms where POLLRDHUP isn't supported?

Initially I felt that user might not be able to distinguish whether 1) the
connection has not been established yet or 2) the checking cannot be done on this
platform. But after considering more, such servers are not listed in the function.
So modified like that.

> > - Also, test patch (0002) was combined into this.
> > 0002:
> > - user_name was added after the `valid` to preserve ordering of the old version.
> 
> Do we really need to keep this ordering? Wouldn't it be more intuitive to
> have the user_name column next to server_name? In pg_stat_statements,
> for example, the ordering isn't always preserved, as seen with
> WAL-related columns being added in the middle.

I also prefer the style, so changed accordingly.

> > Thanks for reviewing the patch! PSA new version.
> 
> Thanks for updating the patches!
> 
> 
> The regression test for postgres_fdw failed with the following diff.
> 
> ----------------------
>   SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
>    server_name | valid | user_name | used_in_xact | closed
>   -------------+-------+-----------+--------------+--------
> - loopback    | f     | hayato    | t            |
> + loopback    | f     | runner    | t            |
>                | f     |           | t            |
>   (2 rows)
> ----------------------

This was because the user_name is quite depends on the environment.
I think it looks not so good, but one approach is to print `user_name = CURRENT_USER`
to test the feature. For loopback3, the user_name is set to NULL so that the column
will be NULL as well. How do you think?  Do you have better idea?

> +             * If requested and the connection is not invalidated,
> check the
> +             * status of the remote connection from the backend
> process and
> +             * return the result. Otherwise returns NULL.
> +             */
> +            if (require_verify && !entry->invalidated &&
> entry->conn)
> 
> Should we also consider checking invalidated connections? Even though
> a connection is marked as invalidated, it can still be used until
> the current transaction ends. Therefore, if an invalidated connection
> is used in this transaction (i.e., used_in_xact = true) and has
> already been closed (closed = true), users might want to consider
> rolling back the transaction promptly. Thought?

I confirmed the meaning of `invalidated` attribute. IIUC:

- It is set to true when the server or user mapping is altered, but
- This connection has already been opened within the transaction.

If the entry is invalided, the server_name of postgres_fdw_get_connections is set
set to NULL (because the entry may be modified). Also, the connection is discarded
when the transaction ends.
Based on the unserstanding, yes, the connection should be also checked. One concern
is that user may not recognize which connection is lost (because the column may be blank).

> +        -- Set client_min_messages to ERROR temporary because the
> following
> +        -- function only throws a WARNING on the supported platform.
> 
> Is this still true? From my reading of the code, it doesn't appear
> that the function throws a WARNING.

Good finding, removed.

Attached patches contain above fixes and comment improvements per request from GPT-4o.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/24 20:40, Hayato Kuroda (Fujitsu) wrote:
> Attached patches contain above fixes and comment improvements per request from GPT-4o.

Thanks for updating the patches!

I’ve created a new base patch and split the v42-0001 patch into two parts
to implement the feature and improvements step by step. After pushing these
patches, I’ll focus on the v42-0002 patch next.

I’ve attached the three patches.

v43-0001:
This new patch enhances documentation for postgres_fdw_get_connections()
output columns. The output columns were documented in text format,
which is manageable for the current two columns. However, upcoming patches
will add new columns, making text descriptions less readable.
This patch updates the documentation to use a table format,
making it easier for users to understand each output column.

v43-0002:
This patch adds the "used_in_xact" column to postgres_fdw_get_connections().
It separates this change from the original v42-0001 patch for clarity.

v43-0003
This patch adds the "closed" column to postgres_fdw_get_connections().

I’ve also made several code improvements, for example adding a typedef for
pgfdwVersion to simplify its usage, and updated typedefs.list.

+enum pgfdwVersion
+{
+    PGFDW_V1_0 = 0,
+    PGFDW_V1_2,
+}            pgfdwVersion;

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

> 
> Thanks for updating the patches!
> 
> I’ve created a new base patch and split the v42-0001 patch into two parts
> to implement the feature and improvements step by step. After pushing these
> patches, I’ll focus on the v42-0002 patch next.

Best regards,
Hayato Kuroda
FUJITSU LIMITED
> 
> I’ve attached the three patches.
> 
> v43-0001:
> This new patch enhances documentation for postgres_fdw_get_connections()
> output columns. The output columns were documented in text format,
> which is manageable for the current two columns. However, upcoming patches
> will add new columns, making text descriptions less readable.
> This patch updates the documentation to use a table format,
> making it easier for users to understand each output column.
> 
> v43-0002:
> This patch adds the "used_in_xact" column to postgres_fdw_get_connections().
> It separates this change from the original v42-0001 patch for clarity.
> 
> v43-0003
> This patch adds the "closed" column to postgres_fdw_get_connections().
> 
> I’ve also made several code improvements, for example adding a typedef for
> pgfdwVersion to simplify its usage, and updated typedefs.list.
> 
> +enum pgfdwVersion
> +{
> +    PGFDW_V1_0 = 0,
> +    PGFDW_V1_2,
> +}            pgfdwVersion;
> 
> Regards,
> 
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
I apologize to post incomplete message, here is a correct version.

Dear Fujii-san,

> Thanks for updating the patches!
> 
> I’ve created a new base patch and split the v42-0001 patch into two parts
> to implement the feature and improvements step by step. After pushing these
> patches, I’ll focus on the v42-0002 patch next.

+1.

> I’ve attached the three patches.
> 
> v43-0001:
> This new patch enhances documentation for postgres_fdw_get_connections()
> output columns. The output columns were documented in text format,
> which is manageable for the current two columns. However, upcoming patches
> will add new columns, making text descriptions less readable.
> This patch updates the documentation to use a table format,
> making it easier for users to understand each output column.

Agreed to add the table. I ran a proofreading tool, and it said below points.
You can revise if they are acceptable.

```
+      This function returns information about all open connections that
```
-> "that" can be removed.

```
+         the current transaction but its foreign server or
```
-> can add comma before "but".

> I’ve also made several code improvements, for example adding a typedef for
> pgfdwVersion to simplify its usage, and updated typedefs.list.
> 
> +enum pgfdwVersion
> +{
> +    PGFDW_V1_0 = 0,
> +    PGFDW_V1_2,
> +}            pgfdwVersion;

It was intentionally not added, because while developing pg_createsubscriber,
I got a comment that local-use data have not have to be typedef'd [1]:
```
- src/bin/pg_basebackup/pg_createsubscriber.c
+typedef struct CreateSubscriberOptions
+typedef struct LogicalRepInfo
I think these kinds of local-use struct don't need to be typedef'ed.
(Then you also don't need to update typdefs.list.)
```

A comment for 0002. 

```
+Datum
+postgres_fdw_get_connections_1_2(PG_FUNCTION_ARGS)
+{
+    postgres_fdw_get_connections_internal(fcinfo, PGFDW_V1_2);
+
+    return (Datum) 0;
+}
```

I know they have a same meaning, but can you clarify why (Datun) 0
is returned instead of PG_RETURN_VOID()?

[1]: https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/26 12:15, Hayato Kuroda (Fujitsu) wrote:
> Agreed to add the table. I ran a proofreading tool, and it said below points.
> You can revise if they are acceptable.

Yes, I'm okay with these changes. Thanks for the review!


>> I’ve also made several code improvements, for example adding a typedef for
>> pgfdwVersion to simplify its usage, and updated typedefs.list.
>>
>> +enum pgfdwVersion
>> +{
>> +    PGFDW_V1_0 = 0,
>> +    PGFDW_V1_2,
>> +}            pgfdwVersion;
> 
> It was intentionally not added, because while developing pg_createsubscriber,
> I got a comment that local-use data have not have to be typedef'd [1]:

I didn't know about the rule regarding local-use enums without typedef,
as there are examples like pgssVersion and pgssStoreKind in pg_stat_statements.c.
However, I guess that keeping typedefs.list small is better.
So, I'm fine with removing the typedef from that enum definition
and updating typedefs.list accordingly.


> ```
> +Datum
> +postgres_fdw_get_connections_1_2(PG_FUNCTION_ARGS)
> +{
> +    postgres_fdw_get_connections_internal(fcinfo, PGFDW_V1_2);
> +
> +    return (Datum) 0;
> +}
> ```
> 
> I know they have a same meaning, but can you clarify why (Datun) 0
> is returned instead of PG_RETURN_VOID()?

You're right. I made the change for readability, as similar
functions in pg_stat_statements are implemented that way.
However, it's not essential. I'm okay with reverting it.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

Just in case - based on the agreement in [1], I updated patches to keep them
consistent. We can use same pictures for further discussions...

IIUC, the patch which adds user_name attribute to get_connection() can be discussed
in later stage, is it right?

[1]: https://www.postgresql.org/message-id/768032ee-fb57-494b-b674-1ccb65b6f969%40oss.nttdata.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/26 17:07, Hayato Kuroda (Fujitsu) wrote:
> Dear Fujii-san,
> 
> Just in case - based on the agreement in [1], I updated patches to keep them
> consistent. We can use same pictures for further discussions...

Thanks for updating the patches! I pushed them.

> IIUC, the patch which adds user_name attribute to get_connection() can be discussed
> in later stage, is it right?

No, let's work on the patch at this stage :)

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/26 22:44, Fujii Masao wrote:
> 
> 
> On 2024/07/26 17:07, Hayato Kuroda (Fujitsu) wrote:
>> Dear Fujii-san,
>>
>> Just in case - based on the agreement in [1], I updated patches to keep them
>> consistent. We can use same pictures for further discussions...
> 
> Thanks for updating the patches! I pushed them.

The buildfarm member "hake" reported a failure in the postgres_fdw regression test.

diff -U3 /export/home/elmer/c15x/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out
/export/home/elmer/c15x/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out
--- /export/home/elmer/c15x/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/expected/postgres_fdw.out    Fri Jul 26
19:16:292024
 
+++ /export/home/elmer/c15x/buildroot/HEAD/pgsql.build/contrib/postgres_fdw/results/postgres_fdw.out    Fri Jul 26
19:31:122024
 
@@ -12326,7 +12326,7 @@
    FROM postgres_fdw_get_connections(true);
   case
  ------
-    1
+    0
  (1 row)
  
  -- Clean up


The regression.diffs shows that pgfdw_conn_check returned 0 even though pgfdw_conn_checkable()
returned true. This can happen if the "revents" from poll() indicates something other than
POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or POLLNVAL. Therefore,
IMO pgfdw_conn_check() should be updated as follows. I will test this change.

-               return (input_fd.revents & POLLRDHUP) ? 1 : 0;
+               return (input_fd.revents &
+                               (POLLRDHUP | POLLHUP | POLLERR | POLLNVAL)) ? 1 : 0;

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

Thanks for pushing and analyzing the failure!

> The regression.diffs shows that pgfdw_conn_check returned 0 even though
> pgfdw_conn_checkable()
> returned true. This can happen if the "revents" from poll() indicates something
> other than
> POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or
> POLLNVAL. Therefore,
> IMO pgfdw_conn_check() should be updated as follows. I will test this change.
> 
> -               return (input_fd.revents & POLLRDHUP) ? 1 : 0;
> +               return (input_fd.revents &
> +                               (POLLRDHUP | POLLHUP | POLLERR |
> POLLNVAL)) ? 1 : 0;

I think you are right.
According to the man page, the input socket is invalid or disconnected if revents
has such bits. So such cases should be also regarded as *failure*.
After the fix, the animal said OK. Great works!

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

> > IIUC, the patch which adds user_name attribute to get_connection() can be
> discussed
> > in later stage, is it right?
> 
> No, let's work on the patch at this stage :)

OK, here is a rebased patch.

- Changed the name of new API from `GetUserMappingFromOid` to `GetUserMappingByOid`
  to keep the name consistent with others.
- Comments and docs were updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/07/29 12:58, Hayato Kuroda (Fujitsu) wrote:
> Dear Fujii-san,
> 
>>> IIUC, the patch which adds user_name attribute to get_connection() can be
>> discussed
>>> in later stage, is it right?
>>
>> No, let's work on the patch at this stage :)
> 
> OK, here is a rebased patch.

Thanks for updating the patch!

> - Changed the name of new API from `GetUserMappingFromOid` to `GetUserMappingByOid`
>    to keep the name consistent with others.

If we expose this function as an FDW helper function, it should return
a complete UserMapping object, including umoptions.

However, if postgres_fdw_get_connections() is the only user of this function,
I'm not inclined to expose it as an FDW helper. Instead, we can either get
the user ID by user mapping OID directly in connection.c using SearchSysCache(),
or add the user ID to ConnCacheEntry and use it in postgres_fdw_get_connections().
Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

> Thanks for updating the patch!
> 
> > - Changed the name of new API from `GetUserMappingFromOid` to
> `GetUserMappingByOid`
> >    to keep the name consistent with others.
> 
> If we expose this function as an FDW helper function, it should return
> a complete UserMapping object, including umoptions.
> 
> However, if postgres_fdw_get_connections() is the only user of this function,
> I'm not inclined to expose it as an FDW helper.

One reason is that the function does not handle any specific data for postgres_fdw,
however, there are no users and requirests from other projects. Based on that, ok,
we can move it to connection.c. If needed, we can export it again.

> Instead, we can either get
> the user ID by user mapping OID directly in connection.c using SearchSysCache(),
> or add the user ID to ConnCacheEntry and use it in
> postgres_fdw_get_connections().
> Thought?

I moved the function to connection.c, which uses the SearchSysCache1().

I've tried both ways, and they worked well. One difference is that when we use
the extended ConnCacheEntry approach and the entry has been invalidated, we cannot
distinguish the reason. For example, in the below case, the entry is invalidated,
so the user_name of the output record will be NULL, whereas the user mapping is
actually still valid. We may be able to add the reason for invalidation, but
I'm not very motivated to modify the part.

```
BEGIN;
SELECT 1 FROM ft1 LIMIT 1; -- ft1 is at server "loopback"
...
ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');
...
SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
-> {"loopback", NULL, ....} will be returned
```

Another reason is that we can keep the code consistent with the server case.
The part does not read data from the entry, and we can follow.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/08/02 14:56, Hayato Kuroda (Fujitsu) wrote:
> I moved the function to connection.c, which uses the SearchSysCache1().
> 
> I've tried both ways, and they worked well. One difference is that when we use
> the extended ConnCacheEntry approach and the entry has been invalidated, we cannot
> distinguish the reason. For example, in the below case, the entry is invalidated,
> so the user_name of the output record will be NULL, whereas the user mapping is
> actually still valid. We may be able to add the reason for invalidation, but
> I'm not very motivated to modify the part.

Understood. Also thanks for updating the patch!

     <term><function>postgres_fdw_get_connections(
       IN check_conn boolean DEFAULT false, OUT server_name text,
       OUT valid boolean, OUT used_in_xact boolean, OUT closed boolean)
       returns setof record</function></term>

In the documentation, this part should be updated to include the user_name output column.


+        <entry><structfield>user_name</structfield></entry>
+        <entry><type>text</type></entry>
+        <entry>
+         The local user name of this connection. If the user mapping is
+         dropped but the connection remains open (i.e., marked as
+         invalid), this will be <literal>NULL</literal>.

How about changing the first description to "Name of the local user mapped to the foreign server of this connection, or
"public"if a public mapping is used." for more precision?
 


- server_name | valid | used_in_xact | closed
--------------+-------+--------------+--------
- loopback1   | t     | t            |
- loopback2   | f     | t            |
+ server_name | user_name | valid | used_in_xact | closed
+-------------+-----------+-------+--------------+--------
+ loopback1   | postgres  | t     | t            |
+ loopback2   | postgres  | t     | t            |

How about displaying the record with loopback2 and valid=false like the previous usage example?


+UserMapping *
+GetUserMappingByOid(Oid umid, bool missing_ok)

postgres_fdw doesn't need a generic function to return UserMapping. How about simplifying the function by removing
unnecessarycode, e.g., as follows?
 

----------
tp = SearchSysCache1(USERMAPPINGOID, ObjectIdGetDatum(umid));
if (!HeapTupleIsValid(tp))
     nulls[i++] = true;
else
{
     Oid userid =  ((Form_pg_user_mapping) GETSTRUCT(tp))->userid;
     values[i++] = CStringGetTextDatum(MappingUserName(userid));
     ReleaseSysCache(tp);
}
----------


-ForeignTable *
-GetForeignTable(Oid relid);
-</programlisting>
-
-     This function returns a <structname>ForeignTable</structname> object for
-     the foreign table with the given OID.  A
-     <structname>ForeignTable</structname> object contains properties of the
-     foreign table (see <filename>foreign/foreign.h</filename> for details).
-    </para>
-
-    <para>
-<programlisting>

Why did you remove these code? Just mistake?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

Thanks for reviewing! PSA new version.

> 
>      <term><function>postgres_fdw_get_connections(
>        IN check_conn boolean DEFAULT false, OUT server_name text,
>        OUT valid boolean, OUT used_in_xact boolean, OUT closed boolean)
>        returns setof record</function></term>
> 
> In the documentation, this part should be updated to include the user_name output
> column.

Right, fixed.

> 
> 
> +        <entry><structfield>user_name</structfield></entry>
> +        <entry><type>text</type></entry>
> +        <entry>
> +         The local user name of this connection. If the user mapping is
> +         dropped but the connection remains open (i.e., marked as
> +         invalid), this will be <literal>NULL</literal>.
> 
> How about changing the first description to "Name of the local user mapped to the
> foreign server of this connection, or "public" if a public mapping is used." for more
> precision?

Added. I ran Grammarly and it said OK.

> - server_name | valid | used_in_xact | closed
> --------------+-------+--------------+--------
> - loopback1   | t     | t            |
> - loopback2   | f     | t            |
> + server_name | user_name | valid | used_in_xact | closed
> +-------------+-----------+-------+--------------+--------
> + loopback1   | postgres  | t     | t            |
> + loopback2   | postgres  | t     | t            |
> 
> How about displaying the record with loopback2 and valid=false like the previous
> usage example?

I did not done that be cause either of server_name or user_name is NULL and
it might be strange. But yes, the example should have more information.
Based on that, I added a tuple so that the example has below. Thought?

loopback1 - user is "postgres", valid
loopback2 - user is "public", valid
loopback3 - user is NULL, invalid

> 
> 
> +UserMapping *
> +GetUserMappingByOid(Oid umid, bool missing_ok)
> 
> postgres_fdw doesn't need a generic function to return UserMapping. How about
> simplifying the function by removing unnecessary code, e.g., as follows?
> 
> ----------
> tp = SearchSysCache1(USERMAPPINGOID, ObjectIdGetDatum(umid));
> if (!HeapTupleIsValid(tp))
>      nulls[i++] = true;
> else
> {
>      Oid userid =  ((Form_pg_user_mapping) GETSTRUCT(tp))->userid;
>      values[i++] = CStringGetTextDatum(MappingUserName(userid));
>      ReleaseSysCache(tp);
> }
> ----------

Largely agreed, but some comments and Assertion() may be needed. Done.

> -ForeignTable *
> -GetForeignTable(Oid relid);
> -</programlisting>
> -
> -     This function returns a <structname>ForeignTable</structname> object
> for
> -     the foreign table with the given OID.  A
> -     <structname>ForeignTable</structname> object contains properties of
> the
> -     foreign table (see <filename>foreign/foreign.h</filename> for details).
> -    </para>
> -
> -    <para>
> -<programlisting>
> 
> Why did you remove these code? Just mistake?

Oh, my fault. I tried to remove GetUserMappingByOid() and the entry was also
Removed at that time. Restored.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Fujii-san,

Thanks for reviewing!

> I made a couple of small adjustments and attached the updated version.
> If that's ok, I'll go ahead and commit it.
> 
> +         Name of the local user mapped to the foreign server of this
> +         connection, or "public" if a public mapping is used. If the user
> 
> I enclosed "public" with <literal> tag, i.e., <literal>public</literal>.

Right, it should be. I grepped sgml files just in case, but they are tagged by <literal>.

> > I did not done that be cause either of server_name or user_name is NULL and
> > it might be strange. But yes, the example should have more information.
> > Based on that, I added a tuple so that the example has below. Thought?
> >
> > loopback1 - user is "postgres", valid
> > loopback2 - user is "public", valid
> > loopback3 - user is NULL, invalid
> 
> LGTM.
> Also I added the following to the example for clarity:
> 
> postgres=# SELECT * FROM postgres_fdw_get_connections(true);

+1.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


RE: [Proposal] Add foreign-server health checks infrastructure

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear members,

(This mail is just a wrap-up)

I found that the final patch was pushed 2 days ago [1] and BF animals say OK for
now. Therefore, I've closed the CF entry as "committed". We can extend the
feature to other platforms, but I think it could be at another thread later.

Thanks everyone for many efforts!

[1]: https://github.com/postgres/postgres/commit/4f08ab55457751308ffd8d33e82155758cd0e304

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] Add foreign-server health checks infrastructure

From
Fujii Masao
Date:

On 2024/09/20 12:00, Hayato Kuroda (Fujitsu) wrote:
> Dear members,
> 
> (This mail is just a wrap-up)
> 
> I found that the final patch was pushed 2 days ago [1] and BF animals say OK for
> now. Therefore, I've closed the CF entry as "committed".

Thanks!

> We can extend the
> feature to other platforms, but I think it could be at another thread later.

Yes.

> Thanks everyone for many efforts!

You, too! Many thanks!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION