Thread: libpq should not look up all host addresses at once

libpq should not look up all host addresses at once

From
Tom Lane
Date:
Whilst fooling with the patch for CVE-2018-10915, I got annoyed by
the fact that connectDBStart() does the DNS lookups for all supplied
hostnames at once, and fails if any of them are bad.  It was reasonable
to do the lookup there when we only allowed one hostname, but now that
"host" can be a list, this is really pretty stupid.  The whole point
of allowing multiple hostnames is redundancy and avoiding a single
point of failure; but the way this is written, *each* of your servers'
DNS servers is a single point of failure all by itself.  If any one of
them is down, you don't connect.  Plus, in the normal case where you
successfully connect to something before the very last host in the list,
the extra DNS lookups are wasted --- and DNS lookups aren't that cheap,
since they typically involve a network round trip.

So I think what this code should do is (1) look up each hostname as it
needs it, not all at once, and (2) proceed on to the next hostname
if it gets a DNS lookup failure, not fail the whole connection attempt
immediately.  As attached.

I'm tempted to call this a back-patchable bug fix, because the existing
behavior basically negates the entire value of the multi-hostname feature
once you consider the possibility of DNS server failures.  But given the
lack of field complaints, maybe that's an overreaction.

I'll put this in the September fest for review.

            regards, tom lane

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 986f74f..8200b39 100644
*** a/src/interfaces/libpq/fe-connect.c
--- b/src/interfaces/libpq/fe-connect.c
*************** static PGconn *makeEmptyPGconn(void);
*** 360,366 ****
  static bool fillPGconn(PGconn *conn, PQconninfoOption *connOptions);
  static void freePGconn(PGconn *conn);
  static void closePGconn(PGconn *conn);
! static void release_all_addrinfo(PGconn *conn);
  static void sendTerminateConn(PGconn *conn);
  static PQconninfoOption *conninfo_init(PQExpBuffer errorMessage);
  static PQconninfoOption *parse_connection_string(const char *conninfo,
--- 360,366 ----
  static bool fillPGconn(PGconn *conn, PQconninfoOption *connOptions);
  static void freePGconn(PGconn *conn);
  static void closePGconn(PGconn *conn);
! static void release_conn_addrinfo(PGconn *conn);
  static void sendTerminateConn(PGconn *conn);
  static PQconninfoOption *conninfo_init(PQExpBuffer errorMessage);
  static PQconninfoOption *parse_connection_string(const char *conninfo,
*************** setKeepalivesWin32(PGconn *conn)
*** 1742,1751 ****
  static int
  connectDBStart(PGconn *conn)
  {
-     char        portstr[MAXPGPATH];
-     int            ret;
-     int            i;
-
      if (!conn)
          return 0;

--- 1742,1747 ----
*************** connectDBStart(PGconn *conn)
*** 1765,1865 ****
      resetPQExpBuffer(&conn->errorMessage);

      /*
-      * Look up socket addresses for each possible host using
-      * pg_getaddrinfo_all.
-      */
-     for (i = 0; i < conn->nconnhost; ++i)
-     {
-         pg_conn_host *ch = &conn->connhost[i];
-         struct addrinfo hint;
-         int            thisport;
-
-         /* Initialize hint structure */
-         MemSet(&hint, 0, sizeof(hint));
-         hint.ai_socktype = SOCK_STREAM;
-         hint.ai_family = AF_UNSPEC;
-
-         /* Figure out the port number we're going to use. */
-         if (ch->port == NULL || ch->port[0] == '\0')
-             thisport = DEF_PGPORT;
-         else
-         {
-             thisport = atoi(ch->port);
-             if (thisport < 1 || thisport > 65535)
-             {
-                 appendPQExpBuffer(&conn->errorMessage,
-                                   libpq_gettext("invalid port number: \"%s\"\n"),
-                                   ch->port);
-                 conn->options_valid = false;
-                 goto connect_errReturn;
-             }
-         }
-         snprintf(portstr, sizeof(portstr), "%d", thisport);
-
-         /* Use pg_getaddrinfo_all() to resolve the address */
-         ret = 1;
-         switch (ch->type)
-         {
-             case CHT_HOST_NAME:
-                 ret = pg_getaddrinfo_all(ch->host, portstr, &hint, &ch->addrlist);
-                 if (ret || !ch->addrlist)
-                     appendPQExpBuffer(&conn->errorMessage,
-                                       libpq_gettext("could not translate host name \"%s\" to address: %s\n"),
-                                       ch->host, gai_strerror(ret));
-                 break;
-
-             case CHT_HOST_ADDRESS:
-                 hint.ai_flags = AI_NUMERICHOST;
-                 ret = pg_getaddrinfo_all(ch->hostaddr, portstr, &hint, &ch->addrlist);
-                 if (ret || !ch->addrlist)
-                     appendPQExpBuffer(&conn->errorMessage,
-                                       libpq_gettext("could not parse network address \"%s\": %s\n"),
-                                       ch->hostaddr, gai_strerror(ret));
-                 break;
-
-             case CHT_UNIX_SOCKET:
- #ifdef HAVE_UNIX_SOCKETS
-                 hint.ai_family = AF_UNIX;
-                 UNIXSOCK_PATH(portstr, thisport, ch->host);
-                 if (strlen(portstr) >= UNIXSOCK_PATH_BUFLEN)
-                 {
-                     appendPQExpBuffer(&conn->errorMessage,
-                                       libpq_gettext("Unix-domain socket path \"%s\" is too long (maximum %d
bytes)\n"),
-                                       portstr,
-                                       (int) (UNIXSOCK_PATH_BUFLEN - 1));
-                     conn->options_valid = false;
-                     goto connect_errReturn;
-                 }
-
-                 /*
-                  * NULL hostname tells pg_getaddrinfo_all to parse the service
-                  * name as a Unix-domain socket path.
-                  */
-                 ret = pg_getaddrinfo_all(NULL, portstr, &hint, &ch->addrlist);
-                 if (ret || !ch->addrlist)
-                     appendPQExpBuffer(&conn->errorMessage,
-                                       libpq_gettext("could not translate Unix-domain socket path \"%s\" to address:
%s\n"),
-                                       portstr, gai_strerror(ret));
-                 break;
- #else
-                 Assert(false);
-                 conn->options_valid = false;
-                 goto connect_errReturn;
- #endif
-         }
-         if (ret || !ch->addrlist)
-         {
-             if (ch->addrlist)
-             {
-                 pg_freeaddrinfo_all(hint.ai_family, ch->addrlist);
-                 ch->addrlist = NULL;
-             }
-             conn->options_valid = false;
-             goto connect_errReturn;
-         }
-     }
-
-     /*
       * Set up to try to connect to the first host.  (Setting whichhost = -1 is
       * a bit of a cheat, but PQconnectPoll will advance it to 0 before
       * anything else looks at it.)
--- 1761,1766 ----
*************** keep_going:                        /* We will come back to
*** 2148,2153 ****
--- 2049,2060 ----
      /* Time to advance to next connhost[] entry? */
      if (conn->try_next_host)
      {
+         pg_conn_host *ch;
+         struct addrinfo hint;
+         int            thisport;
+         int            ret;
+         char        portstr[MAXPGPATH];
+
          if (conn->whichhost + 1 >= conn->nconnhost)
          {
              /*
*************** keep_going:                        /* We will come back to
*** 2157,2166 ****
              goto error_return;
          }
          conn->whichhost++;
!         conn->addr_cur = conn->connhost[conn->whichhost].addrlist;
!         /* If no addresses for this host, just try the next one */
!         if (conn->addr_cur == NULL)
!             goto keep_going;
          reset_connection_state_machine = true;
          conn->try_next_host = false;
      }
--- 2064,2163 ----
              goto error_return;
          }
          conn->whichhost++;
!
!         /* Drop any address info for previous host */
!         release_conn_addrinfo(conn);
!
!         /*
!          * Look up info for the new host.  On failure, log the problem in
!          * conn->errorMessage, then loop around to try the next host.  (Note
!          * we don't clear try_next_host until we've succeeded.)
!          */
!         ch = &conn->connhost[conn->whichhost];
!
!         /* Initialize hint structure */
!         MemSet(&hint, 0, sizeof(hint));
!         hint.ai_socktype = SOCK_STREAM;
!         conn->addrlist_family = hint.ai_family = AF_UNSPEC;
!
!         /* Figure out the port number we're going to use. */
!         if (ch->port == NULL || ch->port[0] == '\0')
!             thisport = DEF_PGPORT;
!         else
!         {
!             thisport = atoi(ch->port);
!             if (thisport < 1 || thisport > 65535)
!             {
!                 appendPQExpBuffer(&conn->errorMessage,
!                                   libpq_gettext("invalid port number: \"%s\"\n"),
!                                   ch->port);
!                 goto keep_going;
!             }
!         }
!         snprintf(portstr, sizeof(portstr), "%d", thisport);
!
!         /* Use pg_getaddrinfo_all() to resolve the address */
!         switch (ch->type)
!         {
!             case CHT_HOST_NAME:
!                 ret = pg_getaddrinfo_all(ch->host, portstr, &hint,
!                                          &conn->addrlist);
!                 if (ret || !conn->addrlist)
!                 {
!                     appendPQExpBuffer(&conn->errorMessage,
!                                       libpq_gettext("could not translate host name \"%s\" to address: %s\n"),
!                                       ch->host, gai_strerror(ret));
!                     goto keep_going;
!                 }
!                 break;
!
!             case CHT_HOST_ADDRESS:
!                 hint.ai_flags = AI_NUMERICHOST;
!                 ret = pg_getaddrinfo_all(ch->hostaddr, portstr, &hint,
!                                          &conn->addrlist);
!                 if (ret || !conn->addrlist)
!                 {
!                     appendPQExpBuffer(&conn->errorMessage,
!                                       libpq_gettext("could not parse network address \"%s\": %s\n"),
!                                       ch->hostaddr, gai_strerror(ret));
!                     goto keep_going;
!                 }
!                 break;
!
!             case CHT_UNIX_SOCKET:
! #ifdef HAVE_UNIX_SOCKETS
!                 conn->addrlist_family = hint.ai_family = AF_UNIX;
!                 UNIXSOCK_PATH(portstr, thisport, ch->host);
!                 if (strlen(portstr) >= UNIXSOCK_PATH_BUFLEN)
!                 {
!                     appendPQExpBuffer(&conn->errorMessage,
!                                       libpq_gettext("Unix-domain socket path \"%s\" is too long (maximum %d
bytes)\n"),
!                                       portstr,
!                                       (int) (UNIXSOCK_PATH_BUFLEN - 1));
!                     goto keep_going;
!                 }
!
!                 /*
!                  * NULL hostname tells pg_getaddrinfo_all to parse the service
!                  * name as a Unix-domain socket path.
!                  */
!                 ret = pg_getaddrinfo_all(NULL, portstr, &hint,
!                                          &conn->addrlist);
!                 if (ret || !conn->addrlist)
!                 {
!                     appendPQExpBuffer(&conn->errorMessage,
!                                       libpq_gettext("could not translate Unix-domain socket path \"%s\" to address:
%s\n"),
!                                       portstr, gai_strerror(ret));
!                     goto keep_going;
!                 }
! #else
!                 Assert(false);
! #endif
!                 break;
!         }
!
!         /* OK, scan this addrlist for a working server address */
!         conn->addr_cur = conn->addrlist;
          reset_connection_state_machine = true;
          conn->try_next_host = false;
      }
*************** keep_going:                        /* We will come back to
*** 3125,3132 ****
                      return PGRES_POLLING_READING;
                  }

!                 /* We can release the address lists now. */
!                 release_all_addrinfo(conn);

                  /* We are open for business! */
                  conn->status = CONNECTION_OK;
--- 3122,3129 ----
                      return PGRES_POLLING_READING;
                  }

!                 /* We can release the address list now. */
!                 release_conn_addrinfo(conn);

                  /* We are open for business! */
                  conn->status = CONNECTION_OK;
*************** keep_going:                        /* We will come back to
*** 3188,3195 ****
                  return PGRES_POLLING_READING;
              }

!             /* We can release the address lists now. */
!             release_all_addrinfo(conn);

              /* We are open for business! */
              conn->status = CONNECTION_OK;
--- 3185,3192 ----
                  return PGRES_POLLING_READING;
              }

!             /* We can release the address list now. */
!             release_conn_addrinfo(conn);

              /* We are open for business! */
              conn->status = CONNECTION_OK;
*************** keep_going:                        /* We will come back to
*** 3218,3223 ****
--- 3215,3223 ----
                      goto keep_going;
                  }

+                 /* We can release the address list now. */
+                 release_conn_addrinfo(conn);
+
                  /* We are open for business! */
                  conn->status = CONNECTION_OK;
                  return PGRES_POLLING_OK;
*************** keep_going:                        /* We will come back to
*** 3291,3299 ****
                      PQclear(res);
                      termPQExpBuffer(&savedMessage);

-                     /* We can release the address lists now. */
-                     release_all_addrinfo(conn);
-
                      /*
                       * Finish reading any remaining messages before being
                       * considered as ready.
--- 3291,3296 ----
*************** freePGconn(PGconn *conn)
*** 3630,3661 ****
  }

  /*
!  * release_all_addrinfo
!  *     - free addrinfo of all hostconn elements.
   */
-
  static void
! release_all_addrinfo(PGconn *conn)
  {
!     if (conn->connhost != NULL)
      {
!         int            i;
!
!         for (i = 0; i < conn->nconnhost; ++i)
!         {
!             int            family = AF_UNSPEC;
!
! #ifdef HAVE_UNIX_SOCKETS
!             if (conn->connhost[i].type == CHT_UNIX_SOCKET)
!                 family = AF_UNIX;
! #endif
!
!             pg_freeaddrinfo_all(family,
!                                 conn->connhost[i].addrlist);
!             conn->connhost[i].addrlist = NULL;
!         }
      }
-     conn->addr_cur = NULL;
  }

  /*
--- 3627,3644 ----
  }

  /*
!  * release_conn_addrinfo
!  *     - Free any addrinfo list in the PGconn.
   */
  static void
! release_conn_addrinfo(PGconn *conn)
  {
!     if (conn->addrlist)
      {
!         pg_freeaddrinfo_all(conn->addrlist_family, conn->addrlist);
!         conn->addrlist = NULL;
!         conn->addr_cur = NULL;    /* for safety */
      }
  }

  /*
*************** closePGconn(PGconn *conn)
*** 3714,3720 ****
      conn->xactStatus = PQTRANS_IDLE;
      pqClearAsyncResult(conn);    /* deallocate result */
      resetPQExpBuffer(&conn->errorMessage);
!     release_all_addrinfo(conn);

      /* Reset all state obtained from server, too */
      pqDropServerData(conn);
--- 3697,3703 ----
      conn->xactStatus = PQTRANS_IDLE;
      pqClearAsyncResult(conn);    /* deallocate result */
      resetPQExpBuffer(&conn->errorMessage);
!     release_conn_addrinfo(conn);

      /* Reset all state obtained from server, too */
      pqDropServerData(conn);
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index bc60373..fb04c8c 100644
*** a/src/interfaces/libpq/libpq-int.h
--- b/src/interfaces/libpq/libpq-int.h
*************** typedef struct pg_conn_host
*** 312,318 ****
      char       *password;        /* password for this host, read from the
                                   * password file; NULL if not sought or not
                                   * found in password file. */
-     struct addrinfo *addrlist;    /* list of possible backend addresses */
  } pg_conn_host;

  /*
--- 312,317 ----
*************** struct pg_conn
*** 412,418 ****
      /* Transient state needed while establishing connection */
      bool        try_next_addr;    /* time to advance to next address/host? */
      bool        try_next_host;    /* time to advance to next connhost[]? */
!     struct addrinfo *addr_cur;    /* backend address currently being tried */
      PGSetenvStatusType setenv_state;    /* for 2.0 protocol only */
      const PQEnvironmentOption *next_eo;
      bool        send_appname;    /* okay to send application_name? */
--- 411,419 ----
      /* Transient state needed while establishing connection */
      bool        try_next_addr;    /* time to advance to next address/host? */
      bool        try_next_host;    /* time to advance to next connhost[]? */
!     struct addrinfo *addrlist;    /* list of addresses for current connhost */
!     struct addrinfo *addr_cur;    /* the one currently being tried */
!     int            addrlist_family;    /* needed to know how to free addrlist */
      PGSetenvStatusType setenv_state;    /* for 2.0 protocol only */
      const PQEnvironmentOption *next_eo;
      bool        send_appname;    /* okay to send application_name? */

Re: libpq should not look up all host addresses at once

From
Alvaro Herrera
Date:
On 2018-Aug-09, Tom Lane wrote:

> Whilst fooling with the patch for CVE-2018-10915, I got annoyed by
> the fact that connectDBStart() does the DNS lookups for all supplied
> hostnames at once, and fails if any of them are bad.  It was reasonable
> to do the lookup there when we only allowed one hostname, but now that
> "host" can be a list, this is really pretty stupid.  The whole point
> of allowing multiple hostnames is redundancy and avoiding a single
> point of failure; but the way this is written, *each* of your servers'
> DNS servers is a single point of failure all by itself.  If any one of
> them is down, you don't connect.  Plus, in the normal case where you
> successfully connect to something before the very last host in the list,
> the extra DNS lookups are wasted --- and DNS lookups aren't that cheap,
> since they typically involve a network round trip.

I'm not very familiar with the libpq code structure, but I think
connectDBStart() is not used for synchronous connections, only
asynchronous, is that correct?  If that's the case, then perhaps the
reason this hasn't been more widely reported is simple that those two
features (async conns and multihost conninfo strings) are just not very
commonly used ... and even less so with failing DNS setups.

> I'm tempted to call this a back-patchable bug fix, because the existing
> behavior basically negates the entire value of the multi-hostname feature
> once you consider the possibility of DNS server failures.  But given the
> lack of field complaints, maybe that's an overreaction.

Well, since it's broken, then I don't think it serves anybody very well.
I vote to backpatch it to 10.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2018-Aug-09, Tom Lane wrote:
>> Whilst fooling with the patch for CVE-2018-10915, I got annoyed by
>> the fact that connectDBStart() does the DNS lookups for all supplied
>> hostnames at once, and fails if any of them are bad.  It was reasonable
>> to do the lookup there when we only allowed one hostname, but now that
>> "host" can be a list, this is really pretty stupid.

> I'm not very familiar with the libpq code structure, but I think
> connectDBStart() is not used for synchronous connections, only
> asynchronous, is that correct?

No, it's used for all connections.  It's trivial to demonstrate the
problem in, eg, psql:

$ psql "host=localhost"
[ connects OK ]
$ psql "host=localhost,bogus"
psql: could not translate host name "bogus" to address: Name or service not known

>> I'm tempted to call this a back-patchable bug fix, because the existing
>> behavior basically negates the entire value of the multi-hostname feature
>> once you consider the possibility of DNS server failures.  But given the
>> lack of field complaints, maybe that's an overreaction.

> Well, since it's broken, then I don't think it serves anybody very well.
> I vote to backpatch it to 10.

The only reason not to is that it involves a change in the contents of
struct PGconn.  But I suppose we could put the added fields at the end
in v10, if we're worried about clients looking into that struct (not
that they should be, but ...)

            regards, tom lane


Re: libpq should not look up all host addresses at once

From
Chapman Flack
Date:
On 08/09/2018 11:05 AM, Tom Lane wrote:

> So I think what this code should do is (1) look up each hostname as it
> needs it, not all at once, and (2) proceed on to the next hostname
> if it gets a DNS lookup failure, not fail the whole connection attempt
> immediately.  As attached.

Would it be worth the complexity to be a little async about it,
fling a few DNS requests out, and try the hosts in the order the
responses come back?

-Chap


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Chapman Flack <chap@anastigmatix.net> writes:
> On 08/09/2018 11:05 AM, Tom Lane wrote:
>> So I think what this code should do is (1) look up each hostname as it
>> needs it, not all at once, and (2) proceed on to the next hostname
>> if it gets a DNS lookup failure, not fail the whole connection attempt
>> immediately.  As attached.

> Would it be worth the complexity to be a little async about it,
> fling a few DNS requests out, and try the hosts in the order the
> responses come back?

It would be nice if an async connection request didn't have to block during
DNS lookups ... but I don't know of any portable library API for async DNS
requests, and it's most certainly not worth the trouble for us to write
our own version of getaddrinfo(3).

In practice, I think the async connection mode is mostly a legacy API
at this point anyway; surely most people who need that sort of behavior
are handling it nowadays by invoking libpq on a separate thread.  So I
just don't see it being worth a huge amount of work and maintenance
effort to get that to happen.  (Having said that, at least moving the
lookup from connectDBStart into PQconnectPoll, as this patch does,
is a step in the right direction.)

Now that I think about it, there may be some text in the libpq docs
claiming that the lookup happens in PQconnectStart not PQconnectPoll;
that would need adjustment.

            regards, tom lane


Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
Hello Tom,

> [...]
>
> So I think what this code should do is (1) look up each hostname as it
> needs it, not all at once, and (2) proceed on to the next hostname
> if it gets a DNS lookup failure, not fail the whole connection attempt
> immediately.  As attached.

A quick test, and very quick glance at the code.

"git apply" gives "error: patch with only garbage at line 3". Patch 
applies with "patch -p1".

Patch compiles, global "make check" ok, although I'm unsure whether the
feature is actually tested somewhere. I think not:-(

As you noted in another message, a small doc update should be needed.

Patch works as expected: I tried with failing dns queries, multiple ips 
some of which or all failing connection attempts...

About the behavior from psql point of view:

* if dns works, error messages are only printed if all attempts failed:

  sh> ./psql "host=127.0.0.17,local2.coelho.net"
  psql: could not connect to server: Connection refused
         Is the server running on host "127.0.0.17" and accepting
         TCP/IP connections on port 5432?
  could not connect to server: Connection refused
         Is the server running on host "local2.coelho.net" (127.0.0.3) and accepting
         TCP/IP connections on port 5432?
  could not connect to server: Connection refused
         Is the server running on host "local2.coelho.net" (127.0.0.2) and accepting
         TCP/IP connections on port 5432?

But nothing shows if one succeeds at some point. I understand that libpq 
is doing its job, but I'm wondering whether this is the best behavior.

Maybe the user would like to know that attempts are made and are failing? 
This would suggest some callback mecanism so that the client is informed 
of in progress issues? Maybe this is for future work.

Maybe psql should show these as warnings?

* when the password is required, there is no way to know for which host/ip 
it is:

  sh> psql "host=127.0.0.17,local2.coelho.net,local.coelho.net"
  Password for user fabien:

* once connected, \conninfo shows only a partial information:

  psql> \conninfo
  You are connected to database "postgres" as user "fabien" on host
  "local3.coelho.net" at port "5432".

But local3 is 127.0.0.4 and 127.0.0.1, which one is it?

* Code

atoi("5432+1") == 5432, so the port syntax check is loose, really.

I'd consider wrapping some of the logic. I'd check the port first, then 
move the host resolution stuff into a function.

I would be fine with backpatching, as the current behavior is not somehow 
broken.

-- 
Fabien.


Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
> * when the password is required, there is no way to know for which host/ip it 
> is:
>
> sh> psql "host=127.0.0.17,local2.coelho.net,local.coelho.net"
> Password for user fabien:

In the same vein on a wrong password:

  sh> psql "host=no-such-host,local2.coelho.net,local3.coelho.net"
  Password for user fabien:
  psql: could not translate host name "no-such-host" to address: Name or service not known
  could not connect to server: Connection refused
         Is the server running on host "local2.coelho.net" (127.0.0.2) and accepting
         TCP/IP connections on port 5432?
  could not connect to server: Connection refused
         Is the server running on host "local2.coelho.net" (127.0.0.3) and accepting
         TCP/IP connections on port 5432?
  FATAL:  password authentication failed for user "fabien"

The Fatal error does not really say for which host/ip the password fail.

Basically, with multiple hostnames and ips per hostname, error messages
need to be more precise.

-- 
Fabien.


Re: libpq should not look up all host addresses at once

From
Noah Misch
Date:
On Thu, Aug 09, 2018 at 02:32:58PM -0400, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
> > On 08/09/2018 11:05 AM, Tom Lane wrote:
> >> So I think what this code should do is (1) look up each hostname as it
> >> needs it, not all at once, and (2) proceed on to the next hostname
> >> if it gets a DNS lookup failure, not fail the whole connection attempt
> >> immediately.  As attached.
> 
> > Would it be worth the complexity to be a little async about it,
> > fling a few DNS requests out, and try the hosts in the order the
> > responses come back?
> 
> It would be nice if an async connection request didn't have to block during
> DNS lookups ... but I don't know of any portable library API for async DNS
> requests

https://c-ares.haxx.se/ is the async DNS API that I've experienced.  It works
well if you really need such a thing, but I haven't needed it in libpq.


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
> About the behavior from psql point of view:

> * if dns works, error messages are only printed if all attempts failed:
> But nothing shows if one succeeds at some point. I understand that libpq 
> is doing its job, but I'm wondering whether this is the best behavior.

Yeah, this is the behavior that was established by the multi-host patch
to begin with.  This patch just extends that to treat DNS failures the
same way as we already treated other connection problems.

> Maybe the user would like to know that attempts are made and are failing? 
> This would suggest some callback mecanism so that the client is informed 
> of in progress issues? Maybe this is for future work.

Well, the application can already tell that if it wishes to, by noting
whether PQhost/PQport return the values for the first alternative or
later ones.  Not sure that we need anything more.

> * when the password is required, there is no way to know for which host/ip 
> it is:

Again, I'm not here to re-litigate API decisions that were made in
connection with the multi-host patch.  What was decided (and documented)
at that point was that if you don't want to have the same password for all
the hosts in question, you need to use ~/.pgpass to supply per-host
passwords.

In practice, I'm not sure this matters too much.  It's hard to conceive of
a practical use-case in which all the target hosts aren't interchangeable
from the application's/user's standpoint.  That's why there's little or
no provision for varying the other conn parameters per-host.  We could
imagine future extensions to libpq to allow some or all of the rest of
them to be comma-separated lists, but I'm content to wait for a compelling
use-case to be shown before doing that work.

> atoi("5432+1") == 5432, so the port syntax check is loose, really.

libpq has always parsed port parameters that way.  Tightening it now
is not likely to earn us any thanks.

            regards, tom lane


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
> In the same vein on a wrong password:

>   sh> psql "host=no-such-host,local2.coelho.net,local3.coelho.net"
>   Password for user fabien:
>   psql: could not translate host name "no-such-host" to address: Name or service not known
>   could not connect to server: Connection refused
>          Is the server running on host "local2.coelho.net" (127.0.0.2) and accepting
>          TCP/IP connections on port 5432?
>   could not connect to server: Connection refused
>          Is the server running on host "local2.coelho.net" (127.0.0.3) and accepting
>          TCP/IP connections on port 5432?
>   FATAL:  password authentication failed for user "fabien"

> The Fatal error does not really say for which host/ip the password fail.

Yup, but that's not the province of this patch to improve.  See

https://www.postgresql.org/message-id/25918.1533918960@sss.pgh.pa.us

for one that is trying to improve it.

            regards, tom lane


Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
>> The Fatal error does not really say for which host/ip the password fail.
>
> Yup, but that's not the province of this patch to improve.  See
>
> https://www.postgresql.org/message-id/25918.1533918960@sss.pgh.pa.us
>
> for one that is trying to improve it.

Yep, I gathered that afterwards.

-- 
Fabien.


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
> Patch compiles, global "make check" ok, although I'm unsure whether the
> feature is actually tested somewhere. I think not:-(

Yeah, it's hard to test this stuff without either opening up security
hazards or making unwarranted assumptions about the local network setup.
I think that the standard regression tests only use Unix-socket
communication (except on Windows) for exactly that reason, and that makes
it hard to do anything much about regression-testing this feature.

> As you noted in another message, a small doc update should be needed.

Check.  Proposed doc patch attached.  (Only the last hunk is actually
specific to this patch, the rest is cleanup that I noticed while looking
around for possibly-relevant text.)

> I'd consider wrapping some of the logic. I'd check the port first, then 
> move the host resolution stuff into a function.

Don't really see the value of either ...

            regards, tom lane

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 80e55f5..7c150f3 100644
*** a/doc/src/sgml/libpq.sgml
--- b/doc/src/sgml/libpq.sgml
*************** PostgresPollingStatusType PQconnectPoll(
*** 303,311 ****
         <itemizedlist>
          <listitem>
           <para>
!           The <literal>hostaddr</literal> and <literal>host</literal> parameters are used appropriately to ensure
that
!           name and reverse name queries are not made. See the documentation of
!           these parameters in <xref linkend="libpq-paramkeywords"/> for details.
           </para>
          </listitem>

--- 303,311 ----
         <itemizedlist>
          <listitem>
           <para>
!           The <literal>hostaddr</literal> parameter must be used appropriately
!           to prevent DNS queries from being made.  See the documentation of
!           this parameter in <xref linkend="libpq-paramkeywords"/> for details.
           </para>
          </listitem>

*************** PostgresPollingStatusType PQconnectPoll(
*** 318,324 ****

          <listitem>
           <para>
!           You ensure that the socket is in the appropriate state
            before calling <function>PQconnectPoll</function>, as described below.
           </para>
          </listitem>
--- 318,324 ----

          <listitem>
           <para>
!           You must ensure that the socket is in the appropriate state
            before calling <function>PQconnectPoll</function>, as described below.
           </para>
          </listitem>
*************** PostgresPollingStatusType PQconnectPoll(
*** 326,349 ****
        </para>

        <para>
!        Note: use of <function>PQconnectStartParams</function> is analogous to
!        <function>PQconnectStart</function> shown below.
!       </para>
!
!       <para>
!        To begin a nonblocking connection request, call <literal>conn =
PQconnectStart("<replaceable>connection_info_string</replaceable>")</literal>.
!        If <varname>conn</varname> is null, then <application>libpq</application> has been unable to allocate a new
<structname>PGconn</structname>
!        structure. Otherwise, a valid <structname>PGconn</structname> pointer is returned (though not yet
!        representing a valid connection to the database). On return from
!        <function>PQconnectStart</function>, call <literal>status = PQstatus(conn)</literal>. If
<varname>status</varname>equals 
!        <symbol>CONNECTION_BAD</symbol>, <function>PQconnectStart</function> has failed.
        </para>

        <para>
!        If <function>PQconnectStart</function> succeeds, the next stage is to poll
!        <application>libpq</application> so that it can proceed with the connection sequence.
         Use <function>PQsocket(conn)</function> to obtain the descriptor of the
         socket underlying the database connection.
         Loop thus: If <function>PQconnectPoll(conn)</function> last returned
         <symbol>PGRES_POLLING_READING</symbol>, wait until the socket is ready to
         read (as indicated by <function>select()</function>, <function>poll()</function>, or
--- 326,352 ----
        </para>

        <para>
!        To begin a nonblocking connection request,
!        call <function>PQconnectStart</function>
!        or <function>PQconnectStartParams</function>.  If the result is null,
!        then <application>libpq</application> has been unable to allocate a
!        new <structname>PGconn</structname> structure.  Otherwise, a
!        valid <structname>PGconn</structname> pointer is returned (though not
!        yet representing a valid connection to the database).  Next
!        call <literal>PQstatus(conn)</literal>.  If the result
!        is <symbol>CONNECTION_BAD</symbol>, the connection attempt has already
!        failed, typically because of invalid connection parameters.
        </para>

        <para>
!        If <function>PQconnectStart</function>
!        or <function>PQconnectStartParams</function> succeeds, the next stage
!        is to poll <application>libpq</application> so that it can proceed with
!        the connection sequence.
         Use <function>PQsocket(conn)</function> to obtain the descriptor of the
         socket underlying the database connection.
+        (Caution: do not assume that the socket remains the same
+        across <function>PQconnectPoll</function> calls.)
         Loop thus: If <function>PQconnectPoll(conn)</function> last returned
         <symbol>PGRES_POLLING_READING</symbol>, wait until the socket is ready to
         read (as indicated by <function>select()</function>, <function>poll()</function>, or
*************** PostgresPollingStatusType PQconnectPoll(
*** 352,360 ****
         Conversely, if <function>PQconnectPoll(conn)</function> last returned
         <symbol>PGRES_POLLING_WRITING</symbol>, wait until the socket is ready
         to write, then call <function>PQconnectPoll(conn)</function> again.
!        If you have yet to call
!        <function>PQconnectPoll</function>, i.e., just after the call to
!        <function>PQconnectStart</function>, behave as if it last returned
         <symbol>PGRES_POLLING_WRITING</symbol>.  Continue this loop until
         <function>PQconnectPoll(conn)</function> returns
         <symbol>PGRES_POLLING_FAILED</symbol>, indicating the connection procedure
--- 355,362 ----
         Conversely, if <function>PQconnectPoll(conn)</function> last returned
         <symbol>PGRES_POLLING_WRITING</symbol>, wait until the socket is ready
         to write, then call <function>PQconnectPoll(conn)</function> again.
!        On the first iteration, i.e. if you have yet to call
!        <function>PQconnectPoll</function>, behave as if it last returned
         <symbol>PGRES_POLLING_WRITING</symbol>.  Continue this loop until
         <function>PQconnectPoll(conn)</function> returns
         <symbol>PGRES_POLLING_FAILED</symbol>, indicating the connection procedure
*************** switch(PQstatus(conn))
*** 479,488 ****
        </para>

        <para>
!        Note that if <function>PQconnectStart</function> returns a non-null pointer, you must call
!        <function>PQfinish</function> when you are finished with it, in order to dispose of
!        the structure and any associated memory blocks. This must be done even if
!        the connection attempt fails or is abandoned.
        </para>
       </listitem>
      </varlistentry>
--- 481,492 ----
        </para>

        <para>
!        Note that when <function>PQconnectStart</function>
!        or <function>PQconnectStartParams</function> returns a non-null
!        pointer, you must call <function>PQfinish</function> when you are
!        finished with it, in order to dispose of the structure and any
!        associated memory blocks.  This must be done even if the connection
!        attempt fails or is abandoned.
        </para>
       </listitem>
      </varlistentry>
*************** postgresql://%2Fvar%2Flib%2Fpostgresql/d
*** 913,919 ****
         It is possible to specify multiple hosts to connect to, so that they are
         tried in the given order. In the Keyword/Value format, the <literal>host</literal>,
         <literal>hostaddr</literal>, and <literal>port</literal> options accept a comma-separated
!        list of values. The same number of elements must be given in each option, such
         that e.g. the first <literal>hostaddr</literal> corresponds to the first host name,
         the second <literal>hostaddr</literal> corresponds to the second host name, and so
         forth. As an exception, if only one <literal>port</literal> is specified, it
--- 917,924 ----
         It is possible to specify multiple hosts to connect to, so that they are
         tried in the given order. In the Keyword/Value format, the <literal>host</literal>,
         <literal>hostaddr</literal>, and <literal>port</literal> options accept a comma-separated
!        list of values. The same number of elements must be given in each
!        option that is specified, such
         that e.g. the first <literal>hostaddr</literal> corresponds to the first host name,
         the second <literal>hostaddr</literal> corresponds to the second host name, and so
         forth. As an exception, if only one <literal>port</literal> is specified, it
*************** postgresql://%2Fvar%2Flib%2Fpostgresql/d
*** 922,930 ****

       <para>
         In the connection URI format, you can list multiple <literal>host:port</literal> pairs
!        separated by commas, in the <literal>host</literal> component of the URI. In either
!        format, a single host name can also translate to multiple network addresses. A
!        common example of this is a host that has both an IPv4 and an IPv6 address.
       </para>

       <para>
--- 927,939 ----

       <para>
         In the connection URI format, you can list multiple <literal>host:port</literal> pairs
!        separated by commas, in the <literal>host</literal> component of the URI.
!      </para>
!
!      <para>
!        In either format, a single host name can translate to multiple network
!        addresses. A common example of this is a host that has both an IPv4 and
!        an IPv6 address.
       </para>

       <para>
*************** postgresql://%2Fvar%2Flib%2Fpostgresql/d
*** 958,966 ****
          Name of host to connect to.<indexterm><primary>host name</primary></indexterm>
          If a host name begins with a slash, it specifies Unix-domain
          communication rather than TCP/IP communication; the value is the
!         name of the directory in which the socket file is stored.  If
!         multiple host names are specified, each will be tried in turn in
!         the order given.  The default behavior when <literal>host</literal> is
          not specified, or is empty, is to connect to a Unix-domain
          socket<indexterm><primary>Unix domain socket</primary></indexterm> in
          <filename>/tmp</filename> (or whatever socket directory was specified
--- 967,974 ----
          Name of host to connect to.<indexterm><primary>host name</primary></indexterm>
          If a host name begins with a slash, it specifies Unix-domain
          communication rather than TCP/IP communication; the value is the
!         name of the directory in which the socket file is stored.
!         The default behavior when <literal>host</literal> is
          not specified, or is empty, is to connect to a Unix-domain
          socket<indexterm><primary>Unix domain socket</primary></indexterm> in
          <filename>/tmp</filename> (or whatever socket directory was specified
*************** postgresql://%2Fvar%2Flib%2Fpostgresql/d
*** 997,1004 ****
          <itemizedlist>
           <listitem>
            <para>
!            If <literal>host</literal> is specified without <literal>hostaddr</literal>,
!            a host name lookup occurs.
            </para>
           </listitem>
           <listitem>
--- 1005,1016 ----
          <itemizedlist>
           <listitem>
            <para>
!            If <literal>host</literal> is specified
!            without <literal>hostaddr</literal>, a host name lookup occurs.
!            (When using <function>PQconnectPoll</function>, the lookup occurs
!            when <function>PQconnectPoll</function> first considers this host
!            name, and it may cause <function>PQconnectPoll</function> to block
!            for a significant amount of time.)
            </para>
           </listitem>
           <listitem>

Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
Hello Tom,

>> As you noted in another message, a small doc update should be needed.
>
> Check.  Proposed doc patch attached.  (Only the last hunk is actually
> specific to this patch, the rest is cleanup that I noticed while looking
> around for possibly-relevant text.)

Doc build is ok.

Some comments that you may not find all useful, please accept my apology, 
it just really shows that I read your prose in some detail:-)

The mention of possible reverse dns queries has been removed... but I do 
not think there was any before? There could be if only hostaddr is 
provided but a hostname is required by some auth, but it does not seem to 
be the case according to the documentation.

I read the rational of the host/hostaddr artificial mapping. I cannot say 
I'm thrilled with the result: I do not really see a setting where avoiding 
a DNS query is required but which still needs a hostname for auth... If 
you have GSSAPI or SSPI then you have an underlying network, in which a 
dns query should be fine.

Moreover the feature implementation is misleading as hostaddr changes the 
behavior of host when both are provided. Also, "hosts" accepts ips anyway 
(although is not documented).

STM that we should have had only "host" for specifying the target (name, 
ip, path), and if really necessary an ugly "authname" directive to provide 
names on the side. I know, too late.

I think that in the string format host=foo:5433,bla:5432 could be 
accepted as it is in the URL format.

Some of the changes are not directly related to the multi host feature, 
and should/could be backpatched further?

>> I'd consider wrapping some of the logic. I'd check the port first, then
>> move the host resolution stuff into a function.
>
> Don't really see the value of either ...

What I see is that the host logic is rather lengthy and specific (it 
depends on the connection type -- name, ip, socket including a so nice 
#ifdef), distinct from the port stuff which is dealt with quite quickcly.

By separating the port & host and putting the lengthy stuff in a function 
the scope of the for loop on potential connections would be easier to read 
and understand, and would not require to see CHT_ cases to understand what 
is being done for managing "host" variants, which is a mere detail.

-- 
Fabien.


Re: libpq should not look up all host addresses at once

From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes:
> The mention of possible reverse dns queries has been removed... but I do 
> not think there was any before?

Yeah, that's why I took it out.  If there ever were any reverse lookups in
libpq, we got rid of them AFAICS, so this statement in the docs seemed just
unnecessarily confusing to me.

> I read the rational of the host/hostaddr artificial mapping. I cannot say 
> I'm thrilled with the result: I do not really see a setting where avoiding 
> a DNS query is required but which still needs a hostname for auth... If 
> you have GSSAPI or SSPI then you have an underlying network, in which a 
> dns query should be fine.

The point is that you might wish to avoid a DNS query for speed reasons,
not because it's "required" to do so.  Or, perhaps, you did the DNS
query already using some async DNS library, and you want to pass the
result on to PQconnectStart so it will not block.

> STM that we should have had only "host" for specifying the target (name, 
> ip, path), and if really necessary an ugly "authname" directive to provide 
> names on the side. I know, too late.
> I think that in the string format host=foo:5433,bla:5432 could be 
> accepted as it is in the URL format.

I'm uninterested in these proposals, and they certainly are not something
to include in this particular patch even if I were.

Also, it's too late to redefine the meaning of colon in a host string,
since as you mentioned that could already be an IPv6 address.

>>> I'd consider wrapping some of the logic. I'd check the port first, then
>>> move the host resolution stuff into a function.

>> Don't really see the value of either ...

> What I see is that the host logic is rather lengthy and specific (it 
> depends on the connection type -- name, ip, socket including a so nice 
> #ifdef), distinct from the port stuff which is dealt with quite quickcly.
> By separating the port & host and putting the lengthy stuff in a function 
> the scope of the for loop on potential connections would be easier to read 
> and understand, and would not require to see CHT_ cases to understand what 
> is being done for managing "host" variants, which is a mere detail.

I still don't see the value of it.  This code was inlined into the calling
function before, and it still is with this patch --- just a different
calling function.  Maybe there's an argument for a wholesale refactoring
of PQconnectPoll into smaller pieces, but that's not a task for this patch
either.  (I'm not really convinced it'd be an improvement anyway, at least
not enough of one to justify creating so much back-patching pain.)

            regards, tom lane


Re: libpq should not look up all host addresses at once

From
Garick Hamlin
Date:
On Tue, Aug 14, 2018 at 12:24:32PM +0200, Fabien COELHO wrote:
> 
> Hello Tom,
> 
> >>As you noted in another message, a small doc update should be needed.
> >
> >Check.  Proposed doc patch attached.  (Only the last hunk is actually
> >specific to this patch, the rest is cleanup that I noticed while looking
> >around for possibly-relevant text.)
> 
> Doc build is ok.
> 
> Some comments that you may not find all useful, please accept my apology, it
> just really shows that I read your prose in some detail:-)
> 
> The mention of possible reverse dns queries has been removed... but I do not
> think there was any before? There could be if only hostaddr is provided but
> a hostname is required by some auth, but it does not seem to be the case
> according to the documentation.
> 
> I read the rational of the host/hostaddr artificial mapping. I cannot say
> I'm thrilled with the result: I do not really see a setting where avoiding a
> DNS query is required but which still needs a hostname for auth... If you
> have GSSAPI or SSPI then you have an underlying network, in which a dns
> query should be fine.

FWIW, I think this is useful even it will be uncommon to use.  I run
some HA services here and I find I use this kind of functionality all
the time to test if a standby node functioning properly.  openssh 
GSSAPIServerIdentity does this.  curl does this via '--resolve'.  In
both cases one can check the name authenticates properly via TLS or
GSSAPI while connecting to an IP that is not production.  

The IP might float via VRRP or EIP in AWS, or it might be a service
local OOB network and the frontend might be a load balancer like haproxy.

FWIW, I am not using this for PG today, but this kind of feature is
definitely nice to have for alarming and HA.  It lets proper analysis
happen.  This way not everyone to be called when the local DNS resolver
fails and just the DNS-people can get the 2am call.

Anyway, if it's not a big burden, I suggest you keep it, IIUC.
This kind of thing is really handy especially since today's cloudy-stuff
means one often gets all-the-nat whether one wants it or not.

Garick


Re: libpq should not look up all host addresses at once

From
Nico Williams
Date:
On Tue, Aug 14, 2018 at 03:18:32PM -0400, Garick Hamlin wrote:
> On Tue, Aug 14, 2018 at 12:24:32PM +0200, Fabien COELHO wrote:
> > I read the rational of the host/hostaddr artificial mapping. I cannot say
> > I'm thrilled with the result: I do not really see a setting where avoiding a
> > DNS query is required but which still needs a hostname for auth... If you
> > have GSSAPI or SSPI then you have an underlying network, in which a dns
> > query should be fine.
> 
> FWIW, I think this is useful even it will be uncommon to use.  I run
> some HA services here and I find I use this kind of functionality all
> the time to test if a standby node functioning properly.  openssh 
> GSSAPIServerIdentity does this.  curl does this via '--resolve'.  In
> both cases one can check the name authenticates properly via TLS or
> GSSAPI while connecting to an IP that is not production.  

+1

curl's --resolve is a fantastic diagnostic tool.  I wish it also allowed
changing the destination port as well.

While I'm at it, I strongly prefer using postgresql: URIs to any other
way to specify connect info, and I think PG should do more to encourage
their use -- perhaps even deprecating the alternatives.

Nico
-- 


Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
Hello Garick,

>> I read the rational of the host/hostaddr artificial mapping. I cannot say
>> I'm thrilled with the result: I do not really see a setting where avoiding a
>> DNS query is required but which still needs a hostname for auth... If you
>> have GSSAPI or SSPI then you have an underlying network, in which a dns
>> query should be fine.
>
> FWIW, I think this is useful even it will be uncommon to use.  I run
> some HA services here and I find I use this kind of functionality all
> the time to test if a standby node functioning properly.

Ok, I understand that you want to locally override DNS results for 
testing purposes. Interesting use case.

> Anyway, if it's not a big burden, I suggest you keep it, IIUC.

I would not suggest to remove the feature, esp if there is a reasonable 
use case.

> This kind of thing is really handy especially since today's cloudy-stuff
> means one often gets all-the-nat whether one wants it or not.

Yep, NAT, another possible use case.

So it is a useful feature for specialized use cases! I just lack 
imagination:-) Thanks for the explanations.

Maybe I'll try to just improve the the documentation.

-- 
Fabien.


Re: libpq should not look up all host addresses at once

From
Fabien COELHO
Date:
Hello,

>>>> I'd consider [...]
>>> Don't really see the value of either ...
>> What I see is [...].
> I still don't see the value of it.  [...].

Ok. I suggested to consider, you considered and rejected, fine with me!

-- 
Fabien.