Thread: postmaster / resolv.conf / dns problem
I'm running 7.4.8 on RHEL 3.0 x86. Today, on two separate servers, I modified the resolv.conf file to point from two functioning name servers to two others. Within 5 minutes, one server would not accept new remote connections. I could log in fine w/ psql locally. All name servers involved were working fine before, during, and after. Postgres kept spawning additional postmaster processes, they'd run for a few seconds/minutes, then die. There was nothing in the logs other than routine queries / transaction log recycling. I bounced postgres, and everything is fine. 45 minutes later, the second server did the exact same thing. Anyone ever seen this? thanks!
Cott Lang <cott@internetstaff.com> writes: > I'm running 7.4.8 on RHEL 3.0 x86. > Today, on two separate servers, I modified the resolv.conf file to point > from two functioning name servers to two others. > Within 5 minutes, one server would not accept new remote connections. I > could log in fine w/ psql locally. This is pretty bizarre ... offhand I would not have thought that the postmaster depended on DNS service at all. Were you maybe using DNS names instead of IP addresses in pg_hba.conf? What exactly does "would not accept" mean --- what was the exact error message, and was there anything in the postmaster log? regards, tom lane
> Within 5 minutes, one server would not accept new remote connections. I > could log in fine w/ psql locally. This is pretty bizarre ... offhand I would not have thought that the postmaster depended on DNS service at all. Were you maybe using DNS names instead of IP addresses in pg_hba.conf? What exactly does "would not accept" mean --- what was the exact error message, and was there anything in the postmaster log?
I'm using only IP addresses in pg_hba.conf.
There was nothing in the postmaster log indicating a problem.
The only thing I saw strange was multiple postmasters spawning and disappearing.
The errors I got in the JDBC drivers was the connection pool timing out trying to get a connection, so it's possible they were working, just taking horribly long to connect. Timeouts for Nagios monitoring PG was 10 seconds; pools were 20 seconds. In three years, I've probably seen 3 time outs. :)
Cott Lang wrote: >>> Within 5 minutes, one server would not accept new remote connections. I >>> could log in fine w/ psql locally. >> This is pretty bizarre ... offhand I would not have thought that the >> postmaster depended on DNS service at all. Were you maybe using DNS >> names instead of IP addresses in pg_hba.conf? What exactly does >> "would not accept" mean --- what was the exact error message, >> and was there anything in the postmaster log? > > > I'm using only IP addresses in pg_hba.conf. > > There was nothing in the postmaster log indicating a problem. > > The only thing I saw strange was multiple postmasters spawning and > disappearing. > > The errors I got in the JDBC drivers was the connection pool timing out > trying to get a connection, so it's possible they were working, just > taking horribly long to connect. Timeouts for Nagios monitoring PG was > 10 seconds; pools were 20 seconds. In three years, I've probably seen 3 > time outs. :) Could it be name-lookups for logging purposes? I've been caught out by that elsewhere. -- Richard Huxton Archonet Ltd
Cott Lang <cott@internetstaff.com> writes: >> What exactly does >> "would not accept" mean --- what was the exact error message, >> and was there anything in the postmaster log? > There was nothing in the postmaster log indicating a problem. > The only thing I saw strange was multiple postmasters spawning and > disappearing. > The errors I got in the JDBC drivers was the connection pool timing out > trying to get a connection, so it's possible they were working, just > taking horribly long to connect. Timeouts for Nagios monitoring PG was > 10 seconds; pools were 20 seconds. In that case, the "multiple postmasters" were probably backends spawned in response to JDBC connection attempts, which went away when they noticed the client had disconnected. These symptoms seem consistent with the idea that backend startup was taking a real long time, which given the context has to be blamed on a DNS lookup timing out. Do you have log_hostname enabled? If so, the backends would be trying to do reverse lookups on the IP address of their connected client, and we could explain all the facts with the assumption that that lookup was encountering a 30-second-or-so timeout. Why this should be happening after you change resolv.conf isn't real clear to me, but in any case if you have a gripe about it you should gripe to your libc or libbind supplier, not us. Whatever the problem is, it's down inside the getnameinfo() library routine. regards, tom lane