Re: Website troubles - Mailing list pgsql-general

From Justin Clift
Subject Re: Website troubles
Date
Msg-id 3E38ADBF.6070003@postgresql.org
Whole thread Raw
In response to Re: Website troubles  ("Marc G. Fournier" <scrappy@hub.org>)
List pgsql-general
Marc G. Fournier wrote:
> On Wed, 29 Jan 2003, Robert Treat wrote:
>
>>Well, maybe it does, but when an important news story drives new
>>eyeballs to your website, you need something better than a bouncing $hit
>>happens logo if you want to make a positive impression. All Greg wants
>>to know is what caused the problem and what steps are being taken to
>>make sure it doesn't happen again. That's hardly unreasonable.
>
>
> The problem is/was persistent database connections ... the problem, IMHO,
> is that there is no way of 'timing out' idle connections, so any load on
> the web site that creates a whack of persistent connections, and then they
> all go idle, then if another hit on a different database goes through, it
> gets starved for connections ...
>
> I've started to disable PHPs default of allowing persistent connections,
> which seems to have help'd ...

It seems appropriate to point out a couple of things about now.  The
extra hits from /. only doubled the traffic for a while (easily
handled), and the main traffic that hit the site was hitting the front
portal pages... static data - no PHP nor database connections involved
per connection.

The front portal pages are static .html pages that are generated hourly
from a few dynamic templates.  The reason for the original error
messages showing up is that all of the PHP connections (non persistent
at the time) to the backend database were already used, and the main
portal page couldn't create a new database connection to one of the
databases to properly generate the pages.  Thus, it had a case of the
sads and spat out errors that were in turn frozen into the newly
generated static pages (oh dear).

Once we'd realised (thanks to the people that emailed us about this), we
changed some things so the errors weren't frozen into the static pages
any more and fired off an email to the database admin guys so they could
bump up the max_connections parameter or restart Apache so that the
persistent connections would all be re-established properly.

Here's where the human failure problem kicked it, the majority of the
database admin guys had driven about 6 hours to get to the Open Source
Weekend expo in Canada where PostgreSQL was being presented, and the guy
left behind to cover emergency issues was sick.  Not "sick and didn't
come in to work" mind you, but "sick and the medication the doctor gave
him knocked him out cold mid-keystroke for about 18 hours".  Not just
your average case of the flu.  :(

We probably need to think of some way to automatically fail gracefull if
the same kind of thing happens again in the future, as it's not a load
bearing problem, just a configuration + human combination.  But... that
doesn't mean it's impossible to happen again.

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


pgsql-general by date:

Previous
From: Ryan VanderBijl
Date:
Subject: Re: serialization errors
Next
From: Tom Lane
Date:
Subject: Re: Postgres server output logfile