Re: Multiple hosts in connection string failed to failover in non-hot standby mode - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Multiple hosts in connection string failed to failover in non-hot standby mode
Date
Msg-id 229507.1610056206@sss.pgh.pa.us
Whole thread Raw
In response to Re: Multiple hosts in connection string failed to failover in non-hot standby mode  (Hubert Zhang <zhubert@vmware.com>)
Responses Re: Multiple hosts in connection string failed to failover in non-hot standby mode  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hubert Zhang <zhubert@vmware.com> writes:
> [ 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch ]

I took a quick look at this.  TBH, I'd just drop the first three hunks,
as they've got nothing to do with any failure mode that there's evidence
for in this thread or the prior one, and I'm afraid they're more likely
to create trouble than fix it.

As for the last hunk, why is it after rather than before the SSL/GSS
checks?  I doubt that retrying with/without SSL is going to change
a CANNOT_CONNECT_NOW result, unless maybe by slowing things down to
the point where recovery has finished ;-)

The bigger picture though is

(1) what set of failures should we retry on?  I think CANNOT_CONNECT_NOW
is reasonable, but are there others?

(2) what does this do to the quality of the error messages in cases
where all the connection attempts fail?

I think that error message quality was not thought too much about
in the original development of the multi-host feature, so to some
extent I'm asking you to clean up someone else's mess.  Nonetheless,
I feel that we do need to clean it up before we do things that risk
making it even more confusing.

The problems that I see in this area are first that there's no
real standardization in libpq as to whether to append error messages
together or just flush preceding messages; and second that no effort
is made in multi-connection-attempt cases to separate the errors from
different attempts, much less identify which error goes with which
host or IP address.  I think we really need to put some work into
that.  In some cases you can infer what happened from breadcrumbs
we already put into the text, for example

$ psql -h localhost,/tmp -p 12345
psql: error: could not connect to server: Connection refused
        Is the server running on host "localhost" (::1) and accepting
        TCP/IP connections on port 12345?
could not connect to server: Connection refused
        Is the server running on host "localhost" (127.0.0.1) and accepting
        TCP/IP connections on port 12345?
could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/tmp/.s.PGSQL.12345"?

but this doesn't seem particularly helpfully laid out to me, and we don't
provide the breadcrumbs at all for a lot of other error cases.

I'm vaguely imagining that we could do something more like

could not connect to host "localhost" (::1), port 12345: Connection refused
could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused
could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory

Not quite sure if the "Is the server running" hint is worth preserving.
We'd have to reword it quite a bit, and it'd be very duplicative.

The implementation of this might involve sticking the initial string
(up to the colon, in this example) into conn->errorMessage speculatively
as we try each host.  If we then append an error to it and go around
again, we're good.  If we successfully connect, then the contents of
conn->errorMessage don't matter.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Simple progress reporting for COPY command
Next
From: Josef Šimánek
Date:
Subject: Re: [PATCH] Simple progress reporting for COPY command