Thread: Ident authentication fails due to bind error on server (8.4.8)
Hi, I'm not sure that this is not a configuration or networking issue (so apologies if it is), but we seem to be getting rare (a few times/day) failures with ident authentication because several clients attempt to do it simultaneously over a high-latency connection (capitalized = edited IPs/username etc.): [DB CLIENTADDR(51985) 3173 2011-06-17 10:49:56 CEST] LOG: could not bind to local address "SERVERADDR": Address already in use [DB CLIENTADDR(51985) 3173 2011-06-17 10:49:56 CEST] FATAL: Ident authentication failed for user "USER" [DB CLIENTADDR(51986) 3183 2011-06-17 10:49:56 CEST] FATAL: no pg_hba.conf entry for host "CLIENTADDR", user "USER", database "DB", SSL off on the client side, we had 2 connection attempts, of which 1 failed (apparently): Jun 17 10:49:53 xxx oidentd[12377]: Connection from SERVER (SERVERADDR):0 Jun 17 10:49:53 xxx oidentd[12377]: [SERVER] Successful lookup: 51980 , 5432 : crm (crm) [Fri Jun 17 10:49:53 2011] [error] [client 127.0.0.1] [Fri Jun 17 10:49:53 2011] kv_tpl.pl: DBI connect('dbname=DB;host=SERVER','USER',...) failed: FATAL: Ident authentication failed for user "USER", referer: URL [Fri Jun 17 10:49:53 2011] [error] [client 127.0.0.1] [Fri Jun 17 10:49:53 2011] kv_tpl.pl: FATAL: no pg_hba.conf entry for host "CLIENTADDR", user "USER", database "DB", SSL off at /var/www/crm/kv_tpl.pl line 100, referer: URL Is this a possible race condition in src/backend/libpq/auth.c ? [note: the client/server clocks are 3 seconds apart at this point, I haven't investigated whether that causes issues here] --- /* * Bind to the address which the client originally contacted, otherwise * the ident server won't be able to match up the right connection. This * is necessary if the PostgreSQL server is running on an IP alias. */ rc = bind(sock_fd, la->ai_addr, la->ai_addrlen); if (rc != 0) { ereport(LOG, (errcode_for_socket_access(), errmsg("could not bind to local address \"%s\": %m", local_addr_s))); ident_return = false; goto ident_inet_done; } --- Regards, Marinos
"Marinos Yannikos" <mjy@geizhals.at> writes: > I'm not sure that this is not a configuration or networking issue (so > apologies if it is), but we seem to be getting rare (a few times/day) > failures with ident authentication because several clients attempt to do > it simultaneously over a high-latency connection (capitalized = edited > IPs/username etc.): > [DB CLIENTADDR(51985) 3173 2011-06-17 10:49:56 CEST] LOG: could not bind > to local address "SERVERADDR": Address already in use > [DB CLIENTADDR(51985) 3173 2011-06-17 10:49:56 CEST] FATAL: Ident > authentication failed for user "USER" Hm. What platform is this on? > Is this a possible race condition in src/backend/libpq/auth.c ? I don't think it's a race condition per se. The code ought to be setting up the address argument for bind() with sin_port = 0 so that an unused port number gets assigned. That seems to be what happens on a couple of machines that I tried here, but I notice that the Linux manpage for getaddrinfo says service sets the port in each returned address structure. If this argument is a service name (see services(5)), it is translated to the corresponding port number. This argument can also be specified as a decimal number, which is simply converted to binary. If service is NULL, then the port number of the returned socket addresses will be left uninitialized. In principle this wording would allow getaddrinfo to return the same nonzero port number in multiple backends, which would lead to the reported failure if they were doing ident verification at the same time. I'm thinking maybe we should explicitly pass "0" rather than NULL to getaddrinfo here. On the other hand, it seems to work reliably as-is on my Linux machine, so this is just speculation at this point. (BTW, is it really sane to be using ident auth over a "high latency connection"? That would certainly suggest to me that you could be getting connections from untrustworthy machines ...) regards, tom lane
I wrote: > I don't think it's a race condition per se. The code ought to be > setting up the address argument for bind() with sin_port = 0 so that > an unused port number gets assigned. That seems to be what happens on > a couple of machines that I tried here, but I notice that the Linux > manpage for getaddrinfo says > service sets the port in each returned address structure. If > this argument is a service name (see services(5)), it is > translated to the corresponding port number. This argument can > also be specified as a decimal number, which is simply converted > to binary. If service is NULL, then the port number of the > returned socket addresses will be left uninitialized. > In principle this wording would allow getaddrinfo to return the same > nonzero port number in multiple backends, which would lead to the > reported failure if they were doing ident verification at the same time. > I'm thinking maybe we should explicitly pass "0" rather than NULL to > getaddrinfo here. On the other hand, it seems to work reliably as-is > on my Linux machine, so this is just speculation at this point. I looked at the glibc source code for getaddrinfo, and it looks like they do reliably set sin_port to zero when no service argument is provided, despite the above documentation statement. So that's why it works for me. But still, if you're on a non-Linux platform it seems possible that this is the mechanism for what's biting you. regards, tom lane
On Fri, 17 Jun 2011 19:51:59 +0200, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I looked at the glibc source code for getaddrinfo, and it looks like > they do reliably set sin_port to zero when no service argument is > provided, despite the above documentation statement. So that's why it > works for me. But still, if you're on a non-Linux platform it seems > possible that this is the mechanism for what's biting you. Both client and server are Linux systems here and sin_port is 0 also according to debug output I added. I cannot reproduce the problem reliably (the users are much better testers it seems), so I'm a bit stuck with my best guess being TIME_WAIT issues, perhaps FIN packets getting lost. I've set sysctl -w net.ipv4.tcp_tw_reuse=1 now and will post again if there is any change. > (BTW, is it really sane to be using ident auth over a "high latency > connection"? That would certainly suggest to me that you could be > getting connections from untrustworthy machines ...) Both endpoints are properly firewalled (the sane sysadmins say so) and for this particular connection only one client IP address is allowed by pg_hba.conf, the reason why we also use ident authentication is to allow only a few select uid's on the client host to connect to certain DSNs. Thanks for all the helpful info! Regards, Marinos
On Sat, 18 Jun 2011 04:55:59 +0200, Marinos Yannikos <mjy@geizhals.at> wrote: > sysctl -w net.ipv4.tcp_tw_reuse=1 This fixed the issue apparently, so bind() seems to choose ports in TIME_WAIT state for some reason with sin_port=0 and that caused it. Regards, Marinos