Re: 答复: [GENERAL] About could not connect to server: Connection timed out - Mailing list pgsql-general

From Craig Ringer
Subject Re: 答复: [GENERAL] About could not connect to server: Connection timed out
Date
Msg-id 4ADD9675.8000006@postnewspapers.com.au
Whole thread Raw
Responses Re: Re:答复: [GENERAL] About could not connect to server: Connection timed out
List pgsql-general
On 20/10/2009 3:01 PM, 黄永卫 wrote:
> Thanks for you reply!
> Server and the client conect with the same CISCO switch.

OK, so they're both on the same local network segment, with the same
subnet and IP address range, connected via a single Ethernet switch?
Guess it's probably not the network.

> This issue always occur after we reboot the server and the postgres service
> just   become ready statu for serval several minutes.

Hang on. You reboot the server? Why?

Anyway, let me make sure I understand what you are saying. After you
reboot the server, just after the PostgreSQL service has started up,
there are several minutes where some (but not all) client connections
tend to time out. After that initial problem period, things start to
work properly again and the time-outs stop happening. You only have
problems shortly after PostgreSQL (and usually the whole server) has
been re-started.

Right?

If so: Could it just be that a rush of reconnecting clients as the
server comes up causes so much load that it can't process all the
requests before some clients give up? The server would then find, when
it got around to answering the client, that the client had since closed
the connection, which would result in the errors you see in the log.

Try keeping an eye on the number of connected clients, the server load,
and the server response time just after it starts up. I'll bet you'll
see disk I/O and/or CPU load max out and connections to other services
on the same server (say: remote desktop, ssh, file sharing, etc) are
also very slow or time out. Do *not* just ping the server; that'll
usually remain nearly instant no matter how overloaded the server is.

If the problem does turn out to be the server being overloaded: Perhaps
you should rate-limit client reconnection attempts? A typical technique
that's used is to have clients re-try connections after a random delay.
That way, rather than a "thundering herd" of clients all connecting at
once, they connect at random intervals over a short period after the
server comes back up, so the server only has to process a few connection
attempts at once. It's also often a good idea to have that random delay
start out quite short, and increase a bit over time.

A search for the "thundering herd problem" will tell you a bit more
about this, though not in PostgreSQL specific terms.

> It is possible that server's performance cause the issue (server is too busy
> on that moment) ?

Highly likely given the additional information you've now provided.

--
Craig Ringer

pgsql-general by date:

Previous
From: Luca Ferrari
Date:
Subject: Re: different execution times of the same query
Next
From: Craig Ringer
Date:
Subject: Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?