Thread: Website troubles

Website troubles

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Was it ever resolved exactly what happened to the website last
weekend? Was there a reason it went for such a long time
without being fixed? Is there a phone number or something
one can use in case the webmaster(s) are not monitoring the
lists?

I know more or less *what* happened (php connections filled up due
to using persistent connections) but not *why*. Also, would it not
be a good idea to have the main page (index.html) be static only,
to prevent things like this happening again? New events and news
items could regenerate a static page as needed, perhaps on
a hourly cron schedule. I see no reason for every access to the
main page to query a news/events database for a dynamic page.

IMO, this is yet another argument to "open-source" the website code:
make it available by CVS, solicit feedback on a mailing list,
and allow people to submit patches.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200301290915

-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iD8DBQE+N+KyvJuQZxSWSsgRAoXvAJ45DG7scsu3D30Dd3GX2TMghP1hUQCffWMY
Lls5IO+S/21KYhFFJ4F8yxw=
=ZVDD
-----END PGP SIGNATURE-----



Re: Website troubles

From
Tony Grant
Date:
On Wed, 2003-01-29 at 09:19, Greg Sabino Mullane wrote:

> Was it ever resolved exactly what happened to the website last
> weekend?

Read a little more news my friend!

The whole internet was paralysed by a worm trying to bring down all the
SQL servers of the earth.

Sh§t happens...

Tony Grant

--
www.tgds.net Library management software toolkit,
redhat linux on Sony Vaio C1XD,
Dreamweaver MX with Tomcat and PostgreSQL


Re: Website troubles

From
Tony Grant
Date:
On Wed, 2003-01-29 at 11:16, Rogier van Eeten wrote:

> > > Was it ever resolved exactly what happened to the website last
> > > weekend?
> >
> > Read a little more news my friend!
> >
> > The whole internet was paralysed by a worm trying to bring down all the
> > SQL servers of the earth.
>
> Uhm... wasn't that a mssql-worm. And the patch was out for about half a
> year. So any administrator with a broken mssql wasn't quite good in his
> job. And I sincerely hope that the postgresql mailinglist wasn't running
> from a machine with mssql...

You are right it is a MS-SQL thing but the packets flooding the internet
are just that - packets.

Even my other half asked me why it was taking her so long to connect to
the mail server! Check the traffic reports for the weekend.
theregister.co.uk was timing out on me for most of the weekend - they
run Debian...

Cheers

Tony

--
www.tgds.net Library management software toolkit,
redhat linux on Sony Vaio C1XD,
Dreamweaver MX with Tomcat and PostgreSQL


Re: Website troubles

From
greg@turnstep.com
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


>> Was it ever resolved exactly what happened to the website last
>> weekend?
>
> Read a little more news my friend!
>
> The whole internet was paralysed by a worm trying to bring down
> all the SQL servers of the earth.

I was well aware of the news, and saying "the whole internet" was paralyzed
is bit dramatic. I was interested in learning how a worm that propagates
on a port used by a Microsoft database managed to affect the website
so that no free php connections were available.

I am also interested in why it took so long for it to be resolved, and I
would like to explore the idea of using dynamic pages only when absolutely
necessary.

> Sh-t happens...

Thanks for the insight.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200301291407

-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iD8DBQE+OCbEvJuQZxSWSsgRAtA6AJ9+FgYZ5MQKcEBoR5pnNaHY94YETwCfY/Kj
qJGIPZezaAu0MEgdiBlxRyE=
=Ckjs
-----END PGP SIGNATURE-----



Re: Website troubles

From
Robert Treat
Date:
On Wed, 2003-01-29 at 14:11, greg@turnstep.com wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> >> Was it ever resolved exactly what happened to the website last
> >> weekend?
> >
> > Read a little more news my friend!
> >
> > The whole internet was paralysed by a worm trying to bring down
> > all the SQL servers of the earth.
>
> I was well aware of the news, and saying "the whole internet" was paralyzed
> is bit dramatic. I was interested in learning how a worm that propagates
> on a port used by a Microsoft database managed to affect the website
> so that no free php connections were available.
>

Yes, that's absolute nonsense. A far more likely scenario was the fact
that Friday afternoon slashdot posted an article about the .org cutover
which started generating more traffic as people wanted to find out more
about postgresql.  If you read through the slashdot replies you'll note
someone posting early Saturday morning he is getting the php errors. Go
Occum.

> I am also interested in why it took so long for it to be resolved, and I
> would like to explore the idea of using dynamic pages only when absolutely
> necessary.
>

I think these are all valid questions that i hope to see addressed.

> > Sh-t happens...
>
> Thanks for the insight.
>

Well, maybe it does, but when an important news story drives new
eyeballs to your website, you need something better than a bouncing $hit
happens logo if you want to make a positive impression. All Greg wants
to know is what caused the problem and what steps are being taken to
make sure it doesn't happen again. That's hardly unreasonable.

Robert Treat


Re: Website troubles

From
"Marc G. Fournier"
Date:
On Wed, 29 Jan 2003, Robert Treat wrote:

> Well, maybe it does, but when an important news story drives new
> eyeballs to your website, you need something better than a bouncing $hit
> happens logo if you want to make a positive impression. All Greg wants
> to know is what caused the problem and what steps are being taken to
> make sure it doesn't happen again. That's hardly unreasonable.

The problem is/was persistent database connections ... the problem, IMHO,
is that there is no way of 'timing out' idle connections, so any load on
the web site that creates a whack of persistent connections, and then they
all go idle, then if another hit on a different database goes through, it
gets starved for connections ...

I've started to disable PHPs default of allowing persistent connections,
which seems to have help'd ...


Re: Website troubles

From
Neil Conway
Date:
On Wed, 2003-01-29 at 21:52, Marc G. Fournier wrote:
> The problem is/was persistent database connections ... the problem, IMHO,
> is that there is no way of 'timing out' idle connections, so any load on
> the web site that creates a whack of persistent connections, and then they
> all go idle, then if another hit on a different database goes through, it
> gets starved for connections ...

Couldn't that easily be handled by the client interface (PHP, in this
case) that provides support for persistent connections?

(Assuming that you're suggesting that we add support for timing out
sessions to the backend -- if you're not, my apologies.)

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC




Re: Website troubles

From
Justin Clift
Date:
Marc G. Fournier wrote:
> On Wed, 29 Jan 2003, Robert Treat wrote:
>
>>Well, maybe it does, but when an important news story drives new
>>eyeballs to your website, you need something better than a bouncing $hit
>>happens logo if you want to make a positive impression. All Greg wants
>>to know is what caused the problem and what steps are being taken to
>>make sure it doesn't happen again. That's hardly unreasonable.
>
>
> The problem is/was persistent database connections ... the problem, IMHO,
> is that there is no way of 'timing out' idle connections, so any load on
> the web site that creates a whack of persistent connections, and then they
> all go idle, then if another hit on a different database goes through, it
> gets starved for connections ...
>
> I've started to disable PHPs default of allowing persistent connections,
> which seems to have help'd ...

It seems appropriate to point out a couple of things about now.  The
extra hits from /. only doubled the traffic for a while (easily
handled), and the main traffic that hit the site was hitting the front
portal pages... static data - no PHP nor database connections involved
per connection.

The front portal pages are static .html pages that are generated hourly
from a few dynamic templates.  The reason for the original error
messages showing up is that all of the PHP connections (non persistent
at the time) to the backend database were already used, and the main
portal page couldn't create a new database connection to one of the
databases to properly generate the pages.  Thus, it had a case of the
sads and spat out errors that were in turn frozen into the newly
generated static pages (oh dear).

Once we'd realised (thanks to the people that emailed us about this), we
changed some things so the errors weren't frozen into the static pages
any more and fired off an email to the database admin guys so they could
bump up the max_connections parameter or restart Apache so that the
persistent connections would all be re-established properly.

Here's where the human failure problem kicked it, the majority of the
database admin guys had driven about 6 hours to get to the Open Source
Weekend expo in Canada where PostgreSQL was being presented, and the guy
left behind to cover emergency issues was sick.  Not "sick and didn't
come in to work" mind you, but "sick and the medication the doctor gave
him knocked him out cold mid-keystroke for about 18 hours".  Not just
your average case of the flu.  :(

We probably need to think of some way to automatically fail gracefull if
the same kind of thing happens again in the future, as it's not a load
bearing problem, just a configuration + human combination.  But... that
doesn't mean it's impossible to happen again.

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


Re: Website troubles

From
Lincoln Yeoh
Date:
At 10:52 PM 1/29/03 -0400, Marc G. Fournier wrote:

>The problem is/was persistent database connections ... the problem, IMHO,
>is that there is no way of 'timing out' idle connections, so any load on
>the web site that creates a whack of persistent connections, and then they
>all go idle, then if another hit on a different database goes through, it
>gets starved for connections ...

Why does it get starved for connections if there are idle ones? Why can't
the idle ones connect to a different DB?

Also since the pages showed most of the usual info along with the error
messages, I'd assume that being unable to connect to those databases isn't
such a serious problem, in which case the webapp shouldn't have to display
such ugliness to the user and just show as much of the usual info as
possible and send the errors out of band - to the system logs or such.

In some cases one could make sure the dynamic content webserver's max
connection setting is < Postgresql's max backends. The static content
webserver(s) could have a much higher max connection setting.

Regards,
Link.


Re: Website troubles

From
"scott.marlowe"
Date:
On Wed, 29 Jan 2003, Marc G. Fournier wrote:

> On Wed, 29 Jan 2003, Robert Treat wrote:
>
> > Well, maybe it does, but when an important news story drives new
> > eyeballs to your website, you need something better than a bouncing $hit
> > happens logo if you want to make a positive impression. All Greg wants
> > to know is what caused the problem and what steps are being taken to
> > make sure it doesn't happen again. That's hardly unreasonable.
>
> The problem is/was persistent database connections ... the problem, IMHO,
> is that there is no way of 'timing out' idle connections, so any load on
> the web site that creates a whack of persistent connections, and then they
> all go idle, then if another hit on a different database goes through, it
> gets starved for connections ...
>
> I've started to disable PHPs default of allowing persistent connections,
> which seems to have help'd ...

I've posted on this before once or twice.  Basically, whatever Apache's
max children is set to, postgresql to be set for a higher number of
connections.  since apache defaults to a much higher number, it's a
problem looking to happen.

If you drop the max apache children to say 64 and crank the max
connections on pgsql to 128 or so, it'll work fine.


Re: Website troubles

From
"scott.marlowe"
Date:
On Thu, 30 Jan 2003, Lincoln Yeoh wrote:

> At 10:52 PM 1/29/03 -0400, Marc G. Fournier wrote:
>
> >The problem is/was persistent database connections ... the problem, IMHO,
> >is that there is no way of 'timing out' idle connections, so any load on
> >the web site that creates a whack of persistent connections, and then they
> >all go idle, then if another hit on a different database goes through, it
> >gets starved for connections ...
>
> Why does it get starved for connections if there are idle ones? Why can't
> the idle ones connect to a different DB?

It happens because php runs as a module under apache and each persistant
connection is associated with an apache child / php pair.

To prevent this problem, the sum of all maximum apache children for all
web servers hitting a given database HAS to be lower than the max
connections setting for postgresql or you will eventually, under load and
at the worst possible time, experience connection starvation and have dead
pages loading.  It's an easy configuration change to make.  But it wasn't
made on the postgresql.org boxen apparently before now.


Re: Website troubles

From
Rogier van Eeten
Date:
On Wed, Jan 29, 2003 at 03:48:18PM -0500, Tony Grant wrote:
> On Wed, 2003-01-29 at 09:19, Greg Sabino Mullane wrote:
>
> > Was it ever resolved exactly what happened to the website last
> > weekend?
>
> Read a little more news my friend!
>
> The whole internet was paralysed by a worm trying to bring down all the
> SQL servers of the earth.

Uhm... wasn't that a mssql-worm. And the patch was out for about half a
year. So any administrator with a broken mssql wasn't quite good in his
job. And I sincerely hope that the postgresql mailinglist wasn't running
from a machine with mssql...


Rogier

Re: Website troubles

From
Justin Clift
Date:
Rogier van Eeten wrote:
<snip>
> Uhm... wasn't that a mssql-worm. And the patch was out for about half a
> year. So any administrator with a broken mssql wasn't quite good in his
> job. And I sincerely hope that the postgresql mailinglist wasn't running
> from a machine with mssql...

Hi Rogier,

On the subject of that worm, apparently it didn't affect just MS SQL
Server, but also most (all?) products containing Microsoft Database
Embedded.

Analysis: SQL slammer
http://www.robertgraham.com/journal/030126-sqlslammer.html

"Most victims were infected through MSDE 2000, a lightweight version of
SQL Server installed as part of many applications from Microsoft (e.g.
Viseo) as well as 3rd parties. You might have MSDE on your desktop right
now."

"The problem had little to do with normal SQL Server 2000 installations."

That includes:

Microsoft Visio
Veritas Backup Exec 9.0
McAfee Antivirus (Ha!)

For a "News Story" type of thing:

Worm may not hit Microsoft alone
http://www.msnbc.com/news/866469.asp?cp1=1

Hope that helps.

:-)

Regards and best wishes,

Justin Clift


> Rogier


--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi