Re: Serious Crash last Friday - Mailing list pgsql-general

From Scott Marlowe
Subject Re: Serious Crash last Friday
Date
Msg-id Pine.LNX.4.33.0206201326120.8468-100000@css120.ihs.com
Whole thread Raw
In response to Re: Serious Crash last Friday  ("Henrik Steffen" <steffen@city-map.de>)
Responses Re: Serious Crash last Friday  (Alvaro Herrera <alvherre@atentus.com>)
List pgsql-general
On Thu, 20 Jun 2002, Henrik Steffen wrote:

> Additionally yesterday night there was again a problem with some SELECTs:
>
> NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted shared memory.
> I have rolled back the current transaction and am
> going to terminate your database system connection and exit.
> Please reconnect to the database system and repeat your query.
> DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
>  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Look at that error message again.  It says SOME OTHER backend died
abnormally, and your query was terminated because of it.  I.e. the
following query was NOT the problem it simply couldn't run because some
other query caused a backend to abort.

>
> Command was:
> SELECT name
> FROM regionen
> WHERE region='0119';
>  at /web/pm/CityMap/Abfragen.pm line 135
>
> This is really annoying.

Yes it is.

> When I noticed it this morning, I dropped all indexes and recreated them.
> Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
> properly again.

It would likely have worked fine without all that, since it wasn't the
cause of the backend crash.

> I have now created a cronjob that will drop and recreate all indexes on a
> daily basis.
>
> But shouldn't this be unnecessary ?

Correct.  Someday, someone will step up to the plate and fix the problem
with btrees growing and growing and not reusing dead space.

Til then the solution is to reindex heavily updated indexes during nightly
maintenance.

A few questions.  Have you done any really heavy testing on your server to
make sure it has no problems with its hardware or anything?  I've seen
machines with memory errors or bad blocks on the hard drive slip into
production and wreak havoc due to slow corruption of a database.

Try compiling the linux kernel with a -j 10 switch (i.e. 10 seperate
threads, eats up tons of memory) and see if you get sig 11 errors.  Also,
check your hard drives for bad blocks as well (badblock is the command,
and it can run in a "save a block before write testing it then put the
data back in it" mode that lets you find all the bad blocks on your hard
drives.

Bad blocks are the primary reason I always try to run my database on RAID1
or RAID5 software raid as a minimum on Linux, since a bad block will cause
the affected drive to be marked offline, and not affect your data
integrity.

--
"Force has no place where there is need of skill.", "Haste in every
business brings failures.", "This is the bitterest pain among men, to have
much knowledge but no power." -- Herodotus



pgsql-general by date:

Previous
From: "Dave Page"
Date:
Subject: Re: pgAdmin II
Next
From: Fran Fabrizio
Date:
Subject: Re: selecting all records where a column is null