Re: Multiple hosts in connection string failed to failover in non-hot standby mode - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Multiple hosts in connection string failed to failover in non-hot standby mode |
Date | |
Msg-id | 229507.1610056206@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Multiple hosts in connection string failed to failover in non-hot standby mode (Hubert Zhang <zhubert@vmware.com>) |
Responses |
Re: Multiple hosts in connection string failed to failover in non-hot standby mode
|
List | pgsql-hackers |
Hubert Zhang <zhubert@vmware.com> writes: > [ 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch ] I took a quick look at this. TBH, I'd just drop the first three hunks, as they've got nothing to do with any failure mode that there's evidence for in this thread or the prior one, and I'm afraid they're more likely to create trouble than fix it. As for the last hunk, why is it after rather than before the SSL/GSS checks? I doubt that retrying with/without SSL is going to change a CANNOT_CONNECT_NOW result, unless maybe by slowing things down to the point where recovery has finished ;-) The bigger picture though is (1) what set of failures should we retry on? I think CANNOT_CONNECT_NOW is reasonable, but are there others? (2) what does this do to the quality of the error messages in cases where all the connection attempts fail? I think that error message quality was not thought too much about in the original development of the multi-host feature, so to some extent I'm asking you to clean up someone else's mess. Nonetheless, I feel that we do need to clean it up before we do things that risk making it even more confusing. The problems that I see in this area are first that there's no real standardization in libpq as to whether to append error messages together or just flush preceding messages; and second that no effort is made in multi-connection-attempt cases to separate the errors from different attempts, much less identify which error goes with which host or IP address. I think we really need to put some work into that. In some cases you can infer what happened from breadcrumbs we already put into the text, for example $ psql -h localhost,/tmp -p 12345 psql: error: could not connect to server: Connection refused Is the server running on host "localhost" (::1) and accepting TCP/IP connections on port 12345? could not connect to server: Connection refused Is the server running on host "localhost" (127.0.0.1) and accepting TCP/IP connections on port 12345? could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.12345"? but this doesn't seem particularly helpfully laid out to me, and we don't provide the breadcrumbs at all for a lot of other error cases. I'm vaguely imagining that we could do something more like could not connect to host "localhost" (::1), port 12345: Connection refused could not connect to host "localhost" (127.0.0.1), port 12345: Connection refused could not connect to socket "/tmp/.s.PGSQL.12345": No such file or directory Not quite sure if the "Is the server running" hint is worth preserving. We'd have to reword it quite a bit, and it'd be very duplicative. The implementation of this might involve sticking the initial string (up to the colon, in this example) into conn->errorMessage speculatively as we try each host. If we then append an error to it and go around again, we're good. If we successfully connect, then the contents of conn->errorMessage don't matter. regards, tom lane
pgsql-hackers by date: