Thread: [HACKERS] [bug fix] PG10: libpq doesn't connect to alternative hosts whensome errors occur

Hello,

I found a problem with libpq connection failover.  When libpq cannot connect to earlier hosts in the host list, it
doesn'ttry to connect to other hosts.  For example, when you specify a wrong port that some non-postgres program is
using,or some non-postgres program is using PG's port unexpectedly, you get an error like this:
 

$ psql -h localhost -p 23
psql: received invalid response to SSL negotiation: 
$ psql -h localhost -p 23 -d "sslmode=disable"
psql: expected authentication request from server, but received 

Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later
hosts.

The attached patch fixes this.  I'll add this item in the PostgreSQL 10 Open Items.


Regards
Takayuki Tsunakawa


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later
hosts.

It seems to me that the feature is behaving as wanted. Or in short
attempt to connect to the next host only if a connection cannot be
established. If there is a failure once the exchange with the server
has begun, just consider it as a hard failure. This is an important
property for authentication and SSL connection failures actually.
-- 
Michael



From: Michael Paquier [mailto:michael.paquier@gmail.com]
> It seems to me that the feature is behaving as wanted. Or in short attempt
> to connect to the next host only if a connection cannot be established.
> If there is a failure once the exchange with the server has begun, just
> consider it as a hard failure. This is an important property for
> authentication and SSL connection failures actually.

But PgJDBC behaves as expected -- attempt another connection to other hosts (and succeed).  I believe that's what users
wouldnaturally expect.  The current libpq implementation handles only the socket-level connect failure.
 

Regards
Takayuki Tsunakawa


Michael Paquier <michael.paquier@gmail.com> writes:
> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst later
hosts.

> It seems to me that the feature is behaving as wanted. Or in short
> attempt to connect to the next host only if a connection cannot be
> established. If there is a failure once the exchange with the server
> has begun, just consider it as a hard failure. This is an important
> property for authentication and SSL connection failures actually.

I would not really expect that reconnection would retry after arbitrary
failure cases.  Should it retry for "wrong database name", for instance?
It's not hard to imagine that leading to very confusing behavior.
        regards, tom lane



On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
>> <tsunakawa.takay@jp.fujitsu.com> wrote:
>>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst
laterhosts.
 
>
>> It seems to me that the feature is behaving as wanted. Or in short
>> attempt to connect to the next host only if a connection cannot be
>> established. If there is a failure once the exchange with the server
>> has begun, just consider it as a hard failure. This is an important
>> property for authentication and SSL connection failures actually.
>
> I would not really expect that reconnection would retry after arbitrary
> failure cases.  Should it retry for "wrong database name", for instance?
> It's not hard to imagine that leading to very confusing behavior.

I guess not as well. That would be tricky for the user to have a
different behavior depending on the error returned by the server,
which is why the current code is doing things right IMO. Now, the
feature has been designed similarly to JDBC with its parametrization,
so it could be surprising for users to get a different failure
handling compared to that. Not saying that JDBC is doing it wrong, but
libpq does nothing wrong either.
-- 
Michael



On Sun, May 14, 2017 at 9:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Michael Paquier <michael.paquier@gmail.com> writes:
>>> On Fri, May 12, 2017 at 1:28 PM, Tsunakawa, Takayuki
>>> <tsunakawa.takay@jp.fujitsu.com> wrote:
>>>> Likewise, when the first host has already reached max_connections, libpq doesn't attempt the connection aginst
laterhosts.
 
>>
>>> It seems to me that the feature is behaving as wanted. Or in short
>>> attempt to connect to the next host only if a connection cannot be
>>> established. If there is a failure once the exchange with the server
>>> has begun, just consider it as a hard failure. This is an important
>>> property for authentication and SSL connection failures actually.
>>
>> I would not really expect that reconnection would retry after arbitrary
>> failure cases.  Should it retry for "wrong database name", for instance?
>> It's not hard to imagine that leading to very confusing behavior.
>
> I guess not as well. That would be tricky for the user to have a
> different behavior depending on the error returned by the server,
> which is why the current code is doing things right IMO. Now, the
> feature has been designed similarly to JDBC with its parametrization,
> so it could be surprising for users to get a different failure
> handling compared to that. Not saying that JDBC is doing it wrong, but
> libpq does nothing wrong either.

I concur.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From: Michael Paquier [mailto:michael.paquier@gmail.com]
> On Fri, May 12, 2017 at 10:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I would not really expect that reconnection would retry after
> > arbitrary failure cases.  Should it retry for "wrong database name", for
> instance?
> > It's not hard to imagine that leading to very confusing behavior.
> 
> I guess not as well. That would be tricky for the user to have a different
> behavior depending on the error returned by the server, which is why the
> current code is doing things right IMO. Now, the feature has been designed
> similarly to JDBC with its parametrization, so it could be surprising for
> users to get a different failure handling compared to that. Not saying that
> JDBC is doing it wrong, but libpq does nothing wrong either.

I didn't intend to make the user have a different behavior depending on the error returned by the server.  I meant
attemptingconnection to alternative hosts when the server returned an error. I thought the new libpq feature tries to
connectto other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's
perspective),not the *socket connection* (PG developer's perspective).  I think PgJDBC meets the user's desire better
--"Please connect to some host for better HA if a database server is unavailable for some reason."
 

By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to
imagine...) FYI, as below, the case Tom picked up didn't raise an issue:
 

[libpq]
$ psql -h localhost,localhost -p 5450,5451 -d aaa
psql: FATAL:  database "aaa" does not exist
$


[JDBC]
$ java org.hsqldb.cmdline.SqlTool postgres
SqlTool v. 3481.
2017-05-15T10:23:55.991+0900  SEVERE  Connection error:
org.postgresql.util.PSQLException: FATAL: database "aaa" does not exist Location: File: postinit.c, Routine:
InitPostgres,Line: 846 Server SQLState: 3D000       at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)      at
org.postgresql.core.v3.QueryExecutorImpl.readStartupMessages(QueryExecutorImpl.java:2538)      at
org.postgresql.core.v3.QueryExecutorImpl.<init>(QueryExecutorImpl.java:122)      at
org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:227)      at
org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)      at
org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:194)      at
org.postgresql.Driver.makeConnection(Driver.java:431)      at org.postgresql.Driver.connect(Driver.java:247)       at
java.sql.DriverManager.getConnection(DriverManager.java:664)      at
java.sql.DriverManager.getConnection(DriverManager.java:247)      at org.hsqldb.lib.RCData.getConnection(Unknown
Source)      at org.hsqldb.cmdline.SqlTool.objectMain(Unknown Source)       at org.hsqldb.cmdline.SqlTool.main(Unknown
Source)

Failed to get a connection to 'jdbc:postgresql://localhost:5450,localhost:5451/aaa' as user "tunakawa".
Cause: FATAL: database "aaa" does not exist Location: File: postinit.c, Routine: InitPostgres, Line: 846 Server
SQLState:3D000
 
$

Regards
Takayuki Tsunakawa






On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
>> I guess not as well. That would be tricky for the user to have a different
>> behavior depending on the error returned by the server, which is why the
>> current code is doing things right IMO. Now, the feature has been designed
>> similarly to JDBC with its parametrization, so it could be surprising for
>> users to get a different failure handling compared to that. Not saying that
>> JDBC is doing it wrong, but libpq does nothing wrong either.
>
> I didn't intend to make the user have a different behavior depending on the error returned by the server.  I meant
attemptingconnection to alternative hosts when the server returned an error. I thought the new libpq feature tries to
connectto other hosts when a connection attempt fails, where the "connection" is the *database connection* (user's
perspective),not the *socket connection* (PG developer's perspective).  I think PgJDBC meets the user's desire better
--"Please connect to some host for better HA if a database server is unavailable for some reason." 
>
> By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to
imagine...)

Sure.  Imagine that the user thinks that 'foo' and 'bar' are the
relevant database servers for some service and writes 'dbname=quux
host=foo,bar' as a connection string.  However, actually the user has
made a mistake and 'foo' is supporting some other service entirely; it
has no database 'quux'; the database servers which have database
'quux' are in fact 'bar' and 'baz'.  All appears well as long as 'bar'
remains up, because the missing-database error for 'foo' is ignored
and we just connect to 'bar'.  However, when 'bar' goes down then we
are out of service instead of failing over to 'baz' as we should have
done.

Now it's quite possible that the user, if they test carefully, might
realize that things are not working as intended, because the DBA might
say "hey, all of your connections are being directed to 'bar' instead
of being load-balanced properly!".  But even if they are careful
enough to realize this, it may not be clear what has gone wrong.
Under your proposal, the connection to 'foo' could be failing for *any
reason whatsoever* from lack of connectivity to a missing database to
a missing user to a missing CONNECT privilege to an authentication
failure.  If the user looks at the server log and can pick out the
entries from their own connection attempts they can figure it out, but
otherwise they might spend quite a bit of time wondering what's wrong;
after all, libpq will report no error, as long as the connection to
the other server works.

Now, this is all arguable.  You could certainly say -- and you are
saying -- that this feature ought to be defined to retry after any
kind of failure whatsoever.  But I think what Tom and Michael and I
are saying is that this is a failover feature and therefore ought to
try the next server when the first one in the list appears to have
gone down, but not when the first one in the list is unhappy with the
connection request for some other reason.  Who is right is a judgement
call, but I don't think it's self-evident that users want to ignore
anything and everything that might have gone wrong with the connection
to the first server, rather than only those things which resemble a
down server.  It seems quite possible to me that if we had defined it
as you are proposing, somebody would now be arguing for a behavior
change in the other direction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, May 14, 2017 at 9:50 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> By the way, could you elaborate what problem could occur if my solution is applied?  (it doesn't seem easy for me to
imagine...)

> Sure.  Imagine that the user thinks that 'foo' and 'bar' are the
> relevant database servers for some service and writes 'dbname=quux
> host=foo,bar' as a connection string.  However, actually the user has
> made a mistake and 'foo' is supporting some other service entirely; it
> has no database 'quux'; the database servers which have database
> 'quux' are in fact 'bar' and 'baz'.

Even more simply, suppose that your userid is known to host bar but the
DBA has forgotten to create it on foo.  This is surely a configuration
error that ought to be rectified, not just failed past, or else you don't
have any of the redundancy you think you do.

Of course, the user would have to try connections to both foo and bar
to be sure that they're both configured correctly.  But he might try
"host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing
that both connections had silently been made to bar.

The bigger picture here is that we only want to fail past transient
errors, not configuration errors.  I'm willing to err in favor of
regarding doubtful cases as transient, but most server login rejections
aren't for transient causes.

There might be specific post-connection errors that we should consider
retrying; "too many connections" is an obvious case.
        regards, tom lane



Hello Robert, Tom,

Thank you for being kind enough to explain.  I think I could understand your concern.

From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Who is right is a judgement call, but I don't think it's self-evident that
> users want to ignore anything and everything that might have gone wrong
> with the connection to the first server, rather than only those things which
> resemble a down server.  It seems quite possible to me that if we had defined
> it as you are proposing, somebody would now be arguing for a behavior change
> in the other direction.

Judgment call... so, I understood that it's a matter of choosing between helping to detect configuration errors early
orservice continuity.  Hmm, I'd like to know how other databases treat this, but I couldn't find useful information
aftersome Google search.  I wonder whether I sould ask PgJDBC people if they know something, because they chose service
continuity.


From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> The bigger picture here is that we only want to fail past transient errors,
> not configuration errors.  I'm willing to err in favor of regarding doubtful
> cases as transient, but most server login rejections aren't for transient
> causes.

I got "doubtful cases" as ones such as specifying non-existent host or an unused port number.  In that case, the
configurationerror can't be distinguished from the server failure.
 

What do you think of the following cases?  Don't you want to connect to other servers?

* The DBA shuts down the database.  The server takes a long time to do checkpointing.  During the shutdown checkpoint,
libpqtries to connect to the server and receive an error "the database system is shutting down."
 

* The former primary failed and now is trying to start as a standby, catching up by applying WAL.  During the recovery,
libpqtries to connect to the server and receive an error "the database system is performing recovery."
 

* The database server crashed due to a bug.  Unfortunately, the server takes unexpectedly long time to shut down
becauseit takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the
statsfile during regression tests.)  During the stats file write, libpq tries to connect to the server and receive an
error"the database system is shutting down."
 

These are equivalent to server failure.  I believe we should prioritize rescuing errors during operation over detecting
configurationerrors.
 


> Of course, the user would have to try connections to both foo and bar to
> be sure that they're both configured correctly.  But he might try
> "host=foo,bar" and "host=bar,foo" and figure he was OK, not noticing that
> both connections had silently been made to bar.

In that case, I think he would specify "host=foo" and "host=bar" in turn, because he would be worried about where he's
connectedif he specified multiple hosts.
 

Regards
Takayuki Tsunakawa




On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> What do you think of the following cases?  Don't you want to connect to other servers?
>
> * The DBA shuts down the database.  The server takes a long time to do checkpointing.  During the shutdown
checkpoint,libpq tries to connect to the server and receive an error "the database system is shutting down." 
>
> * The former primary failed and now is trying to start as a standby, catching up by applying WAL.  During the
recovery,libpq tries to connect to the server and receive an error "the database system is performing recovery." 
>
> * The database server crashed due to a bug.  Unfortunately, the server takes unexpectedly long time to shut down
becauseit takes many seconds to write the stats file (as you remember, Tom-san experienced 57 seconds to write the
statsfile during regression tests.)  During the stats file write, libpq tries to connect to the server and receive an
error"the database system is shutting down." 
>
> These are equivalent to server failure.  I believe we should prioritize rescuing errors during operation over
detectingconfiguration errors. 

Yeah, you have a point.  I'm willing to admit that we may have defined
the behavior of the feature incorrectly, provided that you're willing
to admit that you're proposing a definition change, not just a bug
fix.

Anybody else want to weigh in with an opinion here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, you have a point.  I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.

> Anybody else want to weigh in with an opinion here?

I'm not really on board with "try each server until you find one where
this dbname+username+password combination works".  That's just a recipe
for trouble, especially the password angle.

I think it's a good point that there are certain server responses that
we should take as equivalent to "server down", but by the same token
there are responses that we should not take that way.

I suggest that we need to conditionalize the decision based on what
SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
of SQLSTATEs that allow failing over to the next server, or a blacklist of
SQLSTATEs that don't.
        regards, tom lane



Tom, Robert,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Yeah, you have a point.  I'm willing to admit that we may have defined
> > the behavior of the feature incorrectly, provided that you're willing
> > to admit that you're proposing a definition change, not just a bug
> > fix.
>
> > Anybody else want to weigh in with an opinion here?
>
> I'm not really on board with "try each server until you find one where
> this dbname+username+password combination works".  That's just a recipe
> for trouble, especially the password angle.

Agreed.

> I think it's a good point that there are certain server responses that
> we should take as equivalent to "server down", but by the same token
> there are responses that we should not take that way.

Right.

> I suggest that we need to conditionalize the decision based on what
> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
> of SQLSTATEs that allow failing over to the next server, or a blacklist of
> SQLSTATEs that don't.

No particular comment on this.  I do wonder about forward/backwards
compatibility in such lists and if SQLSTATE really covers all
cases/distinctions which are interesting when it comes to making this
decision.

Thanks!

Stephen

Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> I suggest that we need to conditionalize the decision based on what
>> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
>> of SQLSTATEs that allow failing over to the next server, or a blacklist of
>> SQLSTATEs that don't.

> No particular comment on this.  I do wonder about forward/backwards
> compatibility in such lists and if SQLSTATE really covers all
> cases/distinctions which are interesting when it comes to making this
> decision.

If the server is reporting the same SQLSTATE for server-down types
of conditions as for server-up, then that's a bug and we need to change
the SQLSTATE assigned to one case or the other.  The entire point of
SQLSTATE is that it should generally capture distinctions as finely
as client software is likely to be interested in.
        regards, tom lane



On Wed, May 17, 2017 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, you have a point.  I'm willing to admit that we may have defined
>> the behavior of the feature incorrectly, provided that you're willing
>> to admit that you're proposing a definition change, not just a bug
>> fix.
>
>> Anybody else want to weigh in with an opinion here?
>
> I'm not really on board with "try each server until you find one where
> this dbname+username+password combination works".  That's just a recipe
> for trouble, especially the password angle.

Sure, I know what *your* opinion is.  And I'm somewhat inclined to
agree, but not to the degree that I don't think we should hear what
other people have to say.

> I suggest that we need to conditionalize the decision based on what
> SQLSTATE is reported.  Not sure offhand if it's better to have a whitelist
> of SQLSTATEs that allow failing over to the next server, or a blacklist of
> SQLSTATEs that don't.

Urgh.  There are two things I don't like about that.  First, it's a
major redesign of this feature at the 11th hour.  Second, if we can't
even agree on the general question of whether all, some, or no server
errors should cause a retry, the chances of agreeing on which SQL
states to include in the retry loop are probably pretty low.  Indeed,
there might not be one answer that will be right for everyone.

One good argument for leaving this alone entirely is that this feature
was committed on November 3rd and this thread began on May 12th.  If
there was ample time before feature freeze to question the design and
nobody did, then I'm not sure why we should disregard the freeze to
start whacking it around now, especially on the strength of one
complaint.  It may be that after we get some field experience with
this the right thing to do will become clearer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> One good argument for leaving this alone entirely is that this feature
> was committed on November 3rd and this thread began on May 12th.  If
> there was ample time before feature freeze to question the design and
> nobody did, then I'm not sure why we should disregard the freeze to
> start whacking it around now, especially on the strength of one
> complaint.  It may be that after we get some field experience with
> this the right thing to do will become clearer.

I am not particularly convinced by this argument.  As much as we hope
that committers have worked with a variety of people with varying
interests and that individuals who are concerned about such start
testing just as soon as something is committed, that, frankly, isn't how
the world really works, based on my observations, at least.

The point of this period of time between feature freeze and actual
release is, more-or-less, to figure out if the solution we've reached
actually is a good one, and if not, to do something about it.

Thanks!

Stephen

Stephen Frost <sfrost@snowman.net> writes:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> One good argument for leaving this alone entirely is that this feature
>> was committed on November 3rd and this thread began on May 12th.  If
>> there was ample time before feature freeze to question the design and
>> nobody did, then I'm not sure why we should disregard the freeze to
>> start whacking it around now, especially on the strength of one
>> complaint.  It may be that after we get some field experience with
>> this the right thing to do will become clearer.

> I am not particularly convinced by this argument.  As much as we hope
> that committers have worked with a variety of people with varying
> interests and that individuals who are concerned about such start
> testing just as soon as something is committed, that, frankly, isn't how
> the world really works, based on my observations, at least.

> The point of this period of time between feature freeze and actual
> release is, more-or-less, to figure out if the solution we've reached
> actually is a good one, and if not, to do something about it.

Sure, but part of the point of beta testing is to get user feedback.

I agree with Robert's point that major redesign of the feature on the
basis of one complaint isn't necessarily the way to go.  Since the
existing behavior is already out in beta1, let's wait and see if anyone
else complains.  We don't need to fix it Right This Instant.

Maybe add this to the list of open issues to reconsider mid-beta?
        regards, tom lane



On Wed, May 17, 2017 at 12:06 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Who is right is a judgement call, but I don't think it's self-evident that
> users want to ignore anything and everything that might have gone wrong
> with the connection to the first server, rather than only those things which
> resemble a down server.  It seems quite possible to me that if we had defined
> it as you are proposing, somebody would now be arguing for a behavior change
> in the other direction.

Judgment call... so, I understood that it's a matter of choosing between helping to detect configuration errors early or service continuity.

​This is how I've been reading this thread and I'm tending to agree with prioritizing service continuity ​over configuration error detection.  As a client if I have an alternative that ends up working I don't really care whose fault it is that the earlier options weren't.  I don't have enough experience to think up plausible scenarios here but I'm sold on the theory.

David J.

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I agree with Robert's point that major redesign of the feature on the
> basis of one complaint isn't necessarily the way to go.  Since the
> existing behavior is already out in beta1, let's wait and see if anyone
> else complains.  We don't need to fix it Right This Instant.

Fair enough.

> Maybe add this to the list of open issues to reconsider mid-beta?

Works for me.

Thanks!

Stephen

Moin,

On Wed, May 17, 2017 12:34 pm, Robert Haas wrote:
> On Wed, May 17, 2017 at 3:06 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> What do you think of the following cases?  Don't you want to connect to
>> other servers?
>>
>> * The DBA shuts down the database.  The server takes a long time to do
>> checkpointing.  During the shutdown checkpoint, libpq tries to connect
>> to the server and receive an error "the database system is shutting
>> down."
>>
>> * The former primary failed and now is trying to start as a standby,
>> catching up by applying WAL.  During the recovery, libpq tries to
>> connect to the server and receive an error "the database system is
>> performing recovery."
>>
>> * The database server crashed due to a bug.  Unfortunately, the server
>> takes unexpectedly long time to shut down because it takes many seconds
>> to write the stats file (as you remember, Tom-san experienced 57 seconds
>> to write the stats file during regression tests.)  During the stats file
>> write, libpq tries to connect to the server and receive an error "the
>> database system is shutting down."
>>
>> These are equivalent to server failure.  I believe we should prioritize
>> rescuing errors during operation over detecting configuration errors.
>
> Yeah, you have a point.  I'm willing to admit that we may have defined
> the behavior of the feature incorrectly, provided that you're willing
> to admit that you're proposing a definition change, not just a bug
> fix.
>
> Anybody else want to weigh in with an opinion here?

Hm, to me the feature needs to be reliable (for certain values of
reliable) to be usefull.

Consider that you have X hosts (rendundancy), and a lot of applications
that want a stable connection to the one that (still) works, whichever
this is.

You can then either:

1. make one primary, the other standby(s) and play DNS tricks or similiar
to make it appear that there is only one working host, and have all apps
connect to the "one host" (and reconnect to it upon failure)

2. let each app try each host until it finds a working one, if the
connection breaks, retry with the next host

3. or use libpq and let it try the hosts for you.

However, if I understand it correctly, #3 only works reliable in certain
cases (e.g. host down), but not if it is "sort of down". In that case each
app would again need code to retry different hosts until it finds a
working one, instead of letting libpq do the work.

That sound hard to deploy #3 in praxis, as you might easily just code up
#1 or #2 and call it a day.

All the best,

Tels



From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sure, but part of the point of beta testing is to get user feedback.

Yes, and I'm also proposing this in the user's point of view, which I believe holds true for people here.  I'm worried
frommy support experience that strict customers would complain and question the HA.
 


> I agree with Robert's point that major redesign of the feature on the basis
> of one complaint isn't necessarily the way to go.  Since the existing
> behavior is already out in beta1, let's wait and see if anyone else complains.
> We don't need to fix it Right This Instant.

I'm OK with considering this during beta testing.  But do you think there will be enough beta testers and some of them
willfind this kind of subtle problem?  I'm afraid this type of problem will be detected and complained after some time
inproduction...  So, I think we should address this proactively on the basis of good sense.
 

> Maybe add this to the list of open issues to reconsider mid-beta?

Done.  I'll examine whether we can use SQLSTATE.

Regards
Takayuki Tsunakawa




On Thu, May 18, 2017 at 12:07 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>> Maybe add this to the list of open issues to reconsider mid-beta?
>
> Done.  I'll examine whether we can use SQLSTATE.

Does JDBC use something like that to make a difference between a
failure and a move-on-to-next-one? From maintenance point of view,
this would require lookups each time a new SQLSTATE is added. Not sure
that people would remember that.
-- 
Michael



From: Michael Paquier [mailto:michael.paquier@gmail.com]
> Does JDBC use something like that to make a difference between a failure
> and a move-on-to-next-one? 

No, it just tries the next host.  See the first while loop in org/postgresql/jdbc/core/v3/ConnectionFactoryImpl.java.


> From maintenance point of view, this would
> require lookups each time a new SQLSTATE is added. Not sure that people
> would remember that.

Yes, I have the same concern, but I'll see if there's a good way anyway (e.g. whether we can simply use the class code
ofthe SQLSTATE, which seems hopeless.)  I guess PgJDBC's way is practical and sensible in the end.
 

Regards
Takayuki Tsunakawa



From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tsunakawa,
> Takayuki
> Done.  I'll examine whether we can use SQLSTATE.

I tried conceivable errors during connection.  Those SQLSTATEs are as follows:

[transient error (after which you may want to try next host)]
53300 FATAL:  too many connections for role "tuna"
57P03 FATAL:  the database system is starting up

[configuration error (after which you may give up connection without other hosts. Really?)]
55000 FATAL:  database "template0" is not currently accepting connections
3D000 FATAL:  database "aaa" does not exist
28000 FATAL:  no pg_hba.conf entry for host "::1", user "tunakawa", database "postgres", SSL off
28000 FATAL:  role "nouser" does not exist
28P01 FATAL:  password authentication failed for user "tuna"
28P01 DETAIL:  Password does not match for user "tuna".


I looked through the SQLSTATEs, and thought the below ones could possibly be returned during connection:

https://www.postgresql.org/docs/devel/static/errcodes-appendix.html

[transient error]
Class 08 - Connection Exception
Class 40 - Transaction Rollback 
Class 53 - Insufficient Resources 
Class 54 - Program Limit Exceeded 
Class 55 - Object Not In Prerequisite State 
Class 57 - Operator Intervention 
Class 58 - System Error (errors external to PostgreSQL itself) 
Class XX - Internal Error 

[configuration error]
Class 28 - Invalid Authorization Specification 
Class 3D - Invalid Catalog Name 
Class 42 - Syntax Error or Access Rule Violation 

So, how about trying connection to the next host when the class code is neither 28, 3D, nor 42?

Honestly, I'm not happy with this approach, for a maintenance reason that others are worried about.  Besides, when the
connectiontarget is not postgres and returns invalid data, no SQLSTATE is available.  I'm sorry to repeat myself, but I
believePgJDBC's approach is practically good.  If you think the SQLSTATE is the only way to go, I will put up with it.
Itwould be disappointing if nothing is done.
 

Regards
Takayuki Tsunakawa




On Thu, May 18, 2017 at 5:05 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> So, how about trying connection to the next host when the class code is neither 28, 3D, nor 42?
>
> Honestly, I'm not happy with this approach, for a maintenance reason that others are worried about.  Besides, when
theconnection target is not postgres and returns invalid data, no SQLSTATE is available.  I'm sorry to repeat myself,
butI believe PgJDBC's approach is practically good.  If you think the SQLSTATE is the only way to go, I will put up
withit.  It would be disappointing if nothing is done. 

FWIW, I am of the opinion to not have an implementation based on any
SQLSTATE codes, as well as not doing something similar to JDBC.
Keeping things simple is one reason, a second is that the approach
taken by libpq is correct at its root.
--
Michael



On Thu, May 18, 2017 at 7:06 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> FWIW, I am of the opinion to not have an implementation based on any
> SQLSTATE codes, as well as not doing something similar to JDBC.
> Keeping things simple is one reason, a second is that the approach
> taken by libpq is correct at its root.

Because why?

I was initially on the same side as you and Tom, but now I'm really
wavering.  What good is a feature that's supposed to find you a usable
connection if it sometimes decides not to find one?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 18, 2017 at 7:06 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> FWIW, I am of the opinion to not have an implementation based on any
>> SQLSTATE codes, as well as not doing something similar to JDBC.
>> Keeping things simple is one reason, a second is that the approach
>> taken by libpq is correct at its root.
>
> Because why?

Because it is critical to let the user know as well *why* an error
happened. Imagine that this feature is used with multiple nodes, all
primaries. If a DB admin busted the credentials in one of them then
all the load would be redirected on the other nodes, without knowing
what is actually causing the error. Then the node where the
credentials have been changed would just run idle, and the application
would be unaware of that.
-- 
Michael



On 5/17/17 13:19, Tom Lane wrote:
> I agree with Robert's point that major redesign of the feature on the
> basis of one complaint isn't necessarily the way to go.  Since the
> existing behavior is already out in beta1, let's wait and see if anyone
> else complains.  We don't need to fix it Right This Instant.
> 
> Maybe add this to the list of open issues to reconsider mid-beta?

The problem is that if we decide to change the behavior mid-beta, then
we'll only have the rest of beta to find out whether people will like
the other behavior.

I would aim for the behavior that is most suitable for refinement in the
future.  The current behavior seems to match that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
> The problem is that if we decide to change the behavior mid-beta, then we'll
> only have the rest of beta to find out whether people will like the other
> behavior.
> 
> I would aim for the behavior that is most suitable for refinement in the
> future.  The current behavior seems to match that.

I think the pre-final release period is the very timing for refinement, in the perspective of users and PG developers
asusers.  One thing I'm worried is that people here might become more conservative against change once the final
versionis released.
 

Regards
Takayuki Tsunakawa





On Fri, May 19, 2017 at 11:01 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
>> The problem is that if we decide to change the behavior mid-beta, then we'll
>> only have the rest of beta to find out whether people will like the other
>> behavior.
>>
>> I would aim for the behavior that is most suitable for refinement in the
>> future.  The current behavior seems to match that.
>
> I think the pre-final release period is the very timing for refinement, in the perspective of users and PG developers
asusers.
 

Sure that is the correct period to argue.

>  One thing I'm worried is that people here might become more conservative against change once the final version is
released.

Any redesign after release would finish by being a new feature, which
would be in this case a new connection parameter or an extra option
that works with the current parameter, say something to allow soft or
hard failures when multiple hosts are defined.
-- 
Michael



From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
> On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Because why?
> 
> Because it is critical to let the user know as well *why* an error happened.
> Imagine that this feature is used with multiple nodes, all primaries. If
> a DB admin busted the credentials in one of them then all the load would
> be redirected on the other nodes, without knowing what is actually causing
> the error. Then the node where the credentials have been changed would just
> run idle, and the application would be unaware of that.

In that case, the DBA can know the authentication errors in the server log of the idle instance.

I'm sorry to repeat myself, but libpq connection failover is the feature for HA.  So I believe what to prioritize is
HA.

And the documentation looks somewhat misleading.  I get the impression that libpq tries hosts until success regardless
offailure reason: it aims for successful connection, not giving up early.  Who can read this as "libpq gives up
connectionunder some circumstances?"
 


https://www.postgresql.org/docs/devel/static/libpq-connect.html
--------------------------------------------------
It is possible to specify multiple host components, each with an optional port component, in a single URI. A URI of the
formpostgresql://host1:port1,host2:port2,host3:port3/ is equivalent to a connection string of the form
host=host1,host2,host3port=port1,port2,port3. Each host will be tried in turn until a connection is successfully
established.

...
If multiple host names are specified, each will be tried in turn in the order given.
--------------------------------------------------

Regards
Takayuki Tsunakawa


From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
> >  One thing I'm worried is that people here might become more conservative
> against change once the final version is released.
> 
> Any redesign after release would finish by being a new feature, which would
> be in this case a new connection parameter or an extra option that works
> with the current parameter, say something to allow soft or hard failures
> when multiple hosts are defined.

Hmm... but I can't imagine the parameter would be very meaningful for users.

Regards
Takayuki Tsunakawa


On Thu, May 18, 2017 10:24 pm, Tsunakawa, Takayuki wrote:
> From: pgsql-hackers-owner@postgresql.org
>> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael
>> Paquier
>> On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>> > Because why?
>>
>> Because it is critical to let the user know as well *why* an error
>> happened.
>> Imagine that this feature is used with multiple nodes, all primaries.
>> If
>> a DB admin busted the credentials in one of them then all the load
>> would
>> be redirected on the other nodes, without knowing what is actually
>> causing
>> the error. Then the node where the credentials have been changed would
>> just
>> run idle, and the application would be unaware of that.
>
> In that case, the DBA can know the authentication errors in the server log
> of the idle instance.
>
> I'm sorry to repeat myself, but libpq connection failover is the feature
> for HA.  So I believe what to prioritize is HA.

I'm in agreement here, the feature for me sounds very useful for HA, but
HA means it needs to work as reliable as possible, not just "often enough"
:)

If one goes to the length to have multiple instances, there is surely also
monitoring in place to see if they are healthy or under load/stress.

The beaty of having libpq connecting to multiple hosts until one works is
that you can handle temporary unavailability (e.g. one instance is
restarted for patching) and general failure (one instance goes down to
whatever error) in one place and without having to implement this logic
into every app (database user connector).


Imagine f.i. that you have 3 hosts A, B and C and B.

There are two scenarioes here: Everyone tries "A,B,C", or everyone tries
random permutations like "A,C,B" or "B,C,A".

If In the first scenary, almost all connections would go to A, until it no
longer accepts no connections, then they spill over to B.

In the second one, each host gets 1/3 of all connections equally.

Now imagine  that B is down for either a brief period or permantently.

If libpq stops when it connects to B, then the scenarios play out like this:

1: Almost all connections run on A, but a random subset breaks when
spillover to B does not happen. Host C is idle.

2: 2/3 of all connections just work, 1/3 just breaks. Both A and C have a
higher load than usual.

If libpq skips B and continues, then we have instead:

1: Almost all connections run on A, but a random subset spills over to C
after skipping B.

2: All connections run on A or C, B is always skipped if it appears before
A or C.

The admin would see on the monitoring that B is offline (briefly or
permanent) and need to correct it.

From the user's perspective, the second variant is smooth, the first is
breaking randomly. A "database user" would not really want to know that B
is down or why, it would just expect to get a working DB connection.

That's my 0.02 € anyway.

Tels




On Thu, May 18, 2017 at 8:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, May 18, 2017 at 11:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, May 18, 2017 at 7:06 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> FWIW, I am of the opinion to not have an implementation based on any
>>> SQLSTATE codes, as well as not doing something similar to JDBC.
>>> Keeping things simple is one reason, a second is that the approach
>>> taken by libpq is correct at its root.
>>
>> Because why?
>
> Because it is critical to let the user know as well *why* an error
> happened. Imagine that this feature is used with multiple nodes, all
> primaries. If a DB admin busted the credentials in one of them then
> all the load would be redirected on the other nodes, without knowing
> what is actually causing the error. Then the node where the
> credentials have been changed would just run idle, and the application
> would be unaware of that.

The entire purpose of an application-level failover feature is to make
the application unaware of failures.  That's like complaining that the
stove gets hot when you turn it on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, May 19, 2017 at 11:08:41AM +0900, Michael Paquier wrote:
> On Fri, May 19, 2017 at 11:01 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > From: pgsql-hackers-owner@postgresql.org
> >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
> >> The problem is that if we decide to change the behavior mid-beta, then we'll
> >> only have the rest of beta to find out whether people will like the other
> >> behavior.
> >>
> >> I would aim for the behavior that is most suitable for refinement in the
> >> future.  The current behavior seems to match that.
> >
> > I think the pre-final release period is the very timing for refinement, in the perspective of users and PG
developersas users.
 
> 
> Sure that is the correct period to argue.

We've reached that period.  If anyone is going to push for a change here, now
is the time.  Absent such arguments, the behavior won't change.



On Fri, Jul 28, 2017 at 1:30 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, May 19, 2017 at 11:08:41AM +0900, Michael Paquier wrote:
>> On Fri, May 19, 2017 at 11:01 AM, Tsunakawa, Takayuki
>> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> > From: pgsql-hackers-owner@postgresql.org
>> >> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Peter Eisentraut
>> >> The problem is that if we decide to change the behavior mid-beta, then we'll
>> >> only have the rest of beta to find out whether people will like the other
>> >> behavior.
>> >>
>> >> I would aim for the behavior that is most suitable for refinement in the
>> >> future.  The current behavior seems to match that.
>> >
>> > I think the pre-final release period is the very timing for refinement, in the perspective of users and PG
developersas users.
 
>>
>> Sure that is the correct period to argue.
>
> We've reached that period.  If anyone is going to push for a change here, now
> is the time.  Absent such arguments, the behavior won't change.

Well, I started out believing that the current behavior was for the
best, and then completely reversed my position and favored the OP's
proposal.  Nothing has really happened since then to change my mind,
so I guess I'm still in that camp.  But do we have any new data
points?  Have any beta-testers tested this and what do they think?
The only non-developer (i.e. person not living in an ivory tower) who
has weighed in here is Tels, who favored reversing the original
decision and adopting Tsunakawa-san's position, and that was 2 months
ago.

I am pretty reluctant to tinker with this at this late date and in the
face of several opposing votes, but I do think that we bet on the
wrong horse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



From: Robert Haas [mailto:robertmhaas@gmail.com]
> Well, I started out believing that the current behavior was for the best,
> and then completely reversed my position and favored the OP's proposal.
> Nothing has really happened since then to change my mind, so I guess I'm
> still in that camp.  But do we have any new data points?  Have any
> beta-testers tested this and what do they think?
> The only non-developer (i.e. person not living in an ivory tower) who has
> weighed in here is Tels, who favored reversing the original decision and
> adopting Tsunakawa-san's position, and that was 2 months ago.
> 
> I am pretty reluctant to tinker with this at this late date and in the face
> of several opposing votes, but I do think that we bet on the wrong horse.

Thank you Robert and Tels.  Yes, Tels's comment sounds plausible as a representative real user who expects high
availability. I'm sorry to repeat myself, but this feature is for HA, so libpq should attempt to connect to the next
hostwhen it fails to establish a connection.
 

Who can conclude this?  I don't think that no feedback from beta users means satisfaction with the current behavior.

Regards
Takayuki Tsunakawa





[HACKERS] Re: [bug fix] PG10: libpq doesn't connect to alternative hosts whensome errors occur

From
"Tsunakawa, Takayuki"
Date:
Hello, Robert, Noah

From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Fri, Jul 28, 2017 at 1:30 AM, Noah Misch <noah@leadboat.com> wrote:
> > We've reached that period.  If anyone is going to push for a change
> > here, now is the time.  Absent such arguments, the behavior won't change.
> 
> Well, I started out believing that the current behavior was for the best,
> and then completely reversed my position and favored the OP's proposal.
> Nothing has really happened since then to change my mind, so I guess I'm
> still in that camp.  But do we have any new data points?  Have any
> beta-testers tested this and what do they think?
> The only non-developer (i.e. person not living in an ivory tower) who has
> weighed in here is Tels, who favored reversing the original decision and
> adopting Tsunakawa-san's position, and that was 2 months ago.
> 
> I am pretty reluctant to tinker with this at this late date and in the face
> of several opposing votes, but I do think that we bet on the wrong horse.

Sorry again, but how can we handle this?  A non-PG-developer, Tels (and possibly someone else, IIRC) claimed that the
behaviorbe changed during the beta period.  Why should we do nothing?
 

Regards
Takayuki Tsunakawa





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thu, Sep 14, 2017 at 3:23 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Sorry again, but how can we handle this?  A non-PG-developer, Tels (and possibly someone else, IIRC) claimed that the
behaviorbe changed during the beta period.  Why should we do nothing?
 

Because we do not have consensus on changing it.  I've decided that
you're right, but several other people are saying "no".  I can't just
go change it in the face of objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Thursday, September 14, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 14, 2017 at 3:23 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> Sorry again, but how can we handle this?  A non-PG-developer, Tels (and possibly someone else, IIRC) claimed that the behavior be changed during the beta period.  Why should we do nothing?

Because we do not have consensus on changing it.  I've decided that
you're right, but several other people are saying "no".  I can't just
go change it in the face of objections.


Add my +1 for changing this to the official tally.

David J. 
Hello Tom, Michael,
Robert, Noah

From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> On Thu, Sep 14, 2017 at 3:23 AM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > Sorry again, but how can we handle this?  A non-PG-developer, Tels (and
> possibly someone else, IIRC) claimed that the behavior be changed during
> the beta period.  Why should we do nothing?
> 
> Because we do not have consensus on changing it.  I've decided that you're
> right, but several other people are saying "no".  I can't just go change
> it in the face of objections.

How are things decided in cases like this?  Does the RMT team usually do some kind of poll?

So far, there are four proponents (Tels (non-PG-developer), David J., Robert and me), and two opponents (Tom and
Michael).

Regards
Takayuki Tsunakawa


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, Sep 15, 2017 at 3:54 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> So far, there are four proponents (Tels (non-PG-developer), David J., Robert and me), and two opponents (Tom and
Michael).

4-2 is a reasonable vote in favor of proceeding, although it's a bit
marginal given the stature of the opponents.

I'm still not going to change this just before rc1 wraps though.  I
think it has to wait for 11.  There's too much chance of collateral
damage.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers