Thread: postmaster / resolv.conf / dns problem

postmaster / resolv.conf / dns problem

From
Cott Lang
Date:
I'm running 7.4.8 on RHEL 3.0 x86.

Today, on two separate servers, I modified the resolv.conf file to point
from two functioning name servers to two others.

Within 5 minutes, one server would not accept new remote connections. I
could log in fine w/ psql locally.

All name servers involved were working fine before, during, and after.

Postgres kept spawning additional postmaster processes, they'd run for a
few seconds/minutes, then die. There was nothing in the logs other than
routine queries / transaction log recycling.

I bounced postgres, and everything is fine.

45 minutes later, the second server did the exact same thing.

Anyone ever seen this?

thanks!



Re: postmaster / resolv.conf / dns problem

From
Tom Lane
Date:
Cott Lang <cott@internetstaff.com> writes:
> I'm running 7.4.8 on RHEL 3.0 x86.
> Today, on two separate servers, I modified the resolv.conf file to point
> from two functioning name servers to two others.

> Within 5 minutes, one server would not accept new remote connections. I
> could log in fine w/ psql locally.

This is pretty bizarre ... offhand I would not have thought that the
postmaster depended on DNS service at all.  Were you maybe using DNS
names instead of IP addresses in pg_hba.conf?  What exactly does
"would not accept" mean --- what was the exact error message,
and was there anything in the postmaster log?

            regards, tom lane

Re: postmaster / resolv.conf / dns problem

From
Cott Lang
Date:
> Within 5 minutes, one server would not accept new remote connections. I
> could log in fine w/ psql locally. 

This is pretty bizarre ... offhand I would not have thought that the
postmaster depended on DNS service at all.  Were you maybe using DNS
names instead of IP addresses in pg_hba.conf?  What exactly does
"would not accept" mean --- what was the exact error message,
and was there anything in the postmaster log?

I'm using only IP addresses in pg_hba.conf.

There was nothing in the postmaster log indicating a problem.

The only thing I saw strange was multiple postmasters spawning and disappearing.

The errors I got in the JDBC drivers was the connection pool timing out trying to get a connection, so it's possible they were working, just taking horribly long to connect.  Timeouts for Nagios monitoring PG was 10 seconds; pools were 20 seconds. In three years, I've probably seen 3 time outs. :)






Re: postmaster / resolv.conf / dns problem

From
Richard Huxton
Date:
Cott Lang wrote:
>>> Within 5 minutes, one server would not accept new remote connections. I
>>> could log in fine w/ psql locally.
>> This is pretty bizarre ... offhand I would not have thought that the
>> postmaster depended on DNS service at all.  Were you maybe using DNS
>> names instead of IP addresses in pg_hba.conf?  What exactly does
>> "would not accept" mean --- what was the exact error message,
>> and was there anything in the postmaster log?
>
>
> I'm using only IP addresses in pg_hba.conf.
>
> There was nothing in the postmaster log indicating a problem.
>
> The only thing I saw strange was multiple postmasters spawning and
> disappearing.
>
> The errors I got in the JDBC drivers was the connection pool timing out
> trying to get a connection, so it's possible they were working, just
> taking horribly long to connect.  Timeouts for Nagios monitoring PG was
> 10 seconds; pools were 20 seconds. In three years, I've probably seen 3
> time outs. :)

Could it be name-lookups for logging purposes? I've been caught out by
that elsewhere.

--
   Richard Huxton
   Archonet Ltd

Re: postmaster / resolv.conf / dns problem

From
Tom Lane
Date:
Cott Lang <cott@internetstaff.com> writes:
>> What exactly does
>> "would not accept" mean --- what was the exact error message,
>> and was there anything in the postmaster log?

> There was nothing in the postmaster log indicating a problem.

> The only thing I saw strange was multiple postmasters spawning and
> disappearing.

> The errors I got in the JDBC drivers was the connection pool timing out
> trying to get a connection, so it's possible they were working, just
> taking horribly long to connect.  Timeouts for Nagios monitoring PG was
> 10 seconds; pools were 20 seconds.

In that case, the "multiple postmasters" were probably backends spawned
in response to JDBC connection attempts, which went away when they
noticed the client had disconnected.  These symptoms seem consistent
with the idea that backend startup was taking a real long time, which
given the context has to be blamed on a DNS lookup timing out.  Do you
have log_hostname enabled?  If so, the backends would be trying to do
reverse lookups on the IP address of their connected client, and we
could explain all the facts with the assumption that that lookup was
encountering a 30-second-or-so timeout.

Why this should be happening after you change resolv.conf isn't real
clear to me, but in any case if you have a gripe about it you should
gripe to your libc or libbind supplier, not us.  Whatever the problem
is, it's down inside the getnameinfo() library routine.

            regards, tom lane