Thread: Total crash of my db-server

Total crash of my db-server

From
"Henrik Steffen"
Date:
Hello all,

sometimes I experience a total crash of my
db-server while e.g. doing automated maintainance tasks:

At 2:30 am every night the webserver is shut
down, so there won't be any concurrent accesses to the
db-server. then there will be done a
VACUUM FULL

This is what happened tonight while fully vacuuming:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Then, the script selects all user tables and starts
reindexing them. Tonight, reindexeing the first table
started and seconds later the whole server crashed.

No ping, nothing else possible....

This is the list of recent crashes:
Tonight 02:42 am
Yesterday night 02:39 am
Tuesday at 10:34 am
Last saturday at 10:44 am
Last Tuesday at 02:19 am
The saturday before at 04:01 am
The thursday before at 04:02 am
the tuesday before at 02:25 am

Always complete crashes... only reset helped.

Most crashes occur while maintainance tasks.
However, there are some other crashes, too.

There are never any hints in /var/log/messages

I upgraded to postgresql 7.3 recently, but it doesn't
seem to help either.

I am almost desperate.

We are running some mysql-servers here, too, and I
more and more often try to imagine to move my whole
system to a mysql-server... my collegues NEVER have
had such trouble with their mysql-servers yet....

Do you have any hints for me? What can I do? My last
choice would be to move to mysql, but I am almost
desperate....

thanks for your help

--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------


Re: Total crash of my db-server

From
Justin Clift
Date:
Hi Henrik,

This *really* sounds like you have a system wide problem, not just a PostgreSQL problem.

Can't imagine how moving to MySQL will help with that.  ;-)

What Operating System are you using, and when was the last time you patched/updated it with the vendor recommended
patches?

Regards and best wishes,

Justin Clift


Henrik Steffen wrote:
> Hello all,
>
> sometimes I experience a total crash of my
> db-server while e.g. doing automated maintainance tasks:
>
> At 2:30 am every night the webserver is shut
> down, so there won't be any concurrent accesses to the
> db-server. then there will be done a
> VACUUM FULL
>
> This is what happened tonight while fully vacuuming:
>
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Then, the script selects all user tables and starts
> reindexing them. Tonight, reindexeing the first table
> started and seconds later the whole server crashed.
>
> No ping, nothing else possible....
>
> This is the list of recent crashes:
> Tonight 02:42 am
> Yesterday night 02:39 am
> Tuesday at 10:34 am
> Last saturday at 10:44 am
> Last Tuesday at 02:19 am
> The saturday before at 04:01 am
> The thursday before at 04:02 am
> the tuesday before at 02:25 am
>
> Always complete crashes... only reset helped.
>
> Most crashes occur while maintainance tasks.
> However, there are some other crashes, too.
>
> There are never any hints in /var/log/messages
>
> I upgraded to postgresql 7.3 recently, but it doesn't
> seem to help either.
>
> I am almost desperate.
>
> We are running some mysql-servers here, too, and I
> more and more often try to imagine to move my whole
> system to a mysql-server... my collegues NEVER have
> had such trouble with their mysql-servers yet....
>
> Do you have any hints for me? What can I do? My last
> choice would be to move to mysql, but I am almost
> desperate....
>
> thanks for your help
>
> --
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


Re: Total crash of my db-server

From
Ian Barwick
Date:
On Sunday 15 December 2002 15:16, Justin Clift wrote:
> Hi Henrik,
>
> This *really* sounds like you have a system wide problem, not just a
> PostgreSQL problem.
>
> Can't imagine how moving to MySQL will help with that.  ;-)
>
> What Operating System are you using, and when was the last time you
> patched/updated it with the vendor recommended patches?

Addtionally, have you considered the possibility of a hardware
problem? I had a fileserver once which worked perfectly in "normal"
service, but died regularly and inexplicably whenever large amounts
of data were transferred over the network to the backup machine.
Turned out to be a motherboard problem, possibly in combination
with some of the other components, because we were never able
to reproduce the problem outside of that particular machine...


Ian Barwick
barwick@gmx.net


Re: Total crash of my db-server

From
Tom Lane
Date:
>> This *really* sounds like you have a system wide problem, not just a
>> PostgreSQL problem.
>>
>> Can't imagine how moving to MySQL will help with that.  ;-)

Actually, moving to MySQL will make it worse.  We can say with
confidence that a system lockup is not Postgres' fault because Postgres
does not (and will not) run as root.  I'm not sure whether MySQL *must*
be root, but that seems to be a pretty common way of setting it up ...
and when you do that, you can't entirely exclude it from consideration
when you're looking at problems that would require root privileges to
cause.

> Addtionally, have you considered the possibility of a hardware
> problem?

I tend to agree with Ian on that --- it sounds more like flaky hardware
than anything else.  Time for memtest86 and some disk testing too.

            regards, tom lane

Re: Total crash of my db-server

From
Lee Harr
Date:
In article <00d601c2a443$7b7b7dd0$7100a8c0@henrik>, "Henrik Steffen" wrote:
>

> sometimes I experience a total crash of my
> db-server while e.g. doing automated maintainance tasks:
>

The computer crashes or just the database?
It is not clear from your description.

> Always complete crashes... only reset helped.
>

reset postgres? or are you resetting the computer?


> Most crashes occur while maintainance tasks.
> However, there are some other crashes, too.
>

Is there any commonality between crashes? Are the
others maybe during daily/ weekly OS reporting?
(Generally, heavy disk activity)


> We are running some mysql-servers here, too, and I
> more and more often try to imagine to move my whole
> system to a mysql-server... my collegues NEVER have
> had such trouble with their mysql-servers yet....
>
> Do you have any hints for me? What can I do? My last
> choice would be to move to mysql, but I am almost
> desperate....
>


You are running mysql on the same machine? Or are these
separate systems running mysql?

My first reaction is "hardware trouble" but without
more specifics it is tough to make a diagnosis. If
you have a spare box, that might be a quick way to
see if the problem is hardware related.



Re: Total crash of my db-server

From
Kevin Brown
Date:
Henrik Steffen wrote:
>
> Hello all,
>
> sometimes I experience a total crash of my
> db-server while e.g. doing automated maintainance tasks:

[...]

> Then, the script selects all user tables and starts
> reindexing them. Tonight, reindexeing the first table
> started and seconds later the whole server crashed.
>
> No ping, nothing else possible....

If you can't ping the system then it means that the operating system
itself has stopped working properly (the networking stack is managed
solely by the operating system).

That means that you've either managed to tickle a bug in the operating
system itself or you have a hardware problem.

You didn't mention what OS you're running under but it's more likely
that you have a hardware problem than an OS bug.

Moving to MySQL won't help you here, I'm afraid.  Only fixing your
hardware will.


If this is a system that you depend on for production, I recommend
that you use ECC memory if at all possible.  At least then you won't
have to worry nearly as much about the possibility of bad RAM silently
causing errors...


--
Kevin Brown                          kevin@sysexperts.com

Re: Total crash of my db-server

From
"Henrik Steffen"
Date:
Dear Justin,

I am not sure whether it's really a hardware problem,
because I have had similar problems with different machines
and different os- and pgsql-versions before... If you
browse the archive you will find postings from me about
crashes and problems the last 2-3 years...

I can only tell, that the mysql-servers we are running
have never had similar trouble - and they are run on identical
hardware and os-types under almost identical load.

Currently, I am running postgres 7.3 on a Redhat Linux
(Kernel 2.4.19). Most important software packages are
always up2date.



--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Justin Clift" <justin@postgresql.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Sunday, December 15, 2002 3:16 PM
Subject: Re: [GENERAL] Total crash of my db-server


Hi Henrik,

This *really* sounds like you have a system wide problem, not just a
PostgreSQL problem.

Can't imagine how moving to MySQL will help with that.  ;-)

What Operating System are you using, and when was the last time you
patched/updated it with the vendor recommended patches?

Regards and best wishes,

Justin Clift


Henrik Steffen wrote:
> Hello all,
>
> sometimes I experience a total crash of my
> db-server while e.g. doing automated maintainance tasks:
>
> At 2:30 am every night the webserver is shut
> down, so there won't be any concurrent accesses to the
> db-server. then there will be done a
> VACUUM FULL
>
> This is what happened tonight while fully vacuuming:
>
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Then, the script selects all user tables and starts
> reindexing them. Tonight, reindexeing the first table
> started and seconds later the whole server crashed.
>
> No ping, nothing else possible....
>
> This is the list of recent crashes:
> Tonight 02:42 am
> Yesterday night 02:39 am
> Tuesday at 10:34 am
> Last saturday at 10:44 am
> Last Tuesday at 02:19 am
> The saturday before at 04:01 am
> The thursday before at 04:02 am
> the tuesday before at 02:25 am
>
> Always complete crashes... only reset helped.
>
> Most crashes occur while maintainance tasks.
> However, there are some other crashes, too.
>
> There are never any hints in /var/log/messages
>
> I upgraded to postgresql 7.3 recently, but it doesn't
> seem to help either.
>
> I am almost desperate.
>
> We are running some mysql-servers here, too, and I
> more and more often try to imagine to move my whole
> system to a mysql-server... my collegues NEVER have
> had such trouble with their mysql-servers yet....
>
> Do you have any hints for me? What can I do? My last
> choice would be to move to mysql, but I am almost
> desperate....
>
> thanks for your help
>
> --
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


Re: Total crash of my db-server

From
"Henrik Steffen"
Date:
yes, I have thought about it...

I am not sure if it's a hardware problem.

We upgraded to ECC-RAM recently and hoped it would
help, but it didn't.

It's a hardware raid 1 system (mirroring) on IDE
harddrives.

--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Ian Barwick" <barwick@gmx.net>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>; "Justin Clift" <justin@postgresql.org>
Sent: Sunday, December 15, 2002 4:47 PM
Subject: Re: [GENERAL] Total crash of my db-server


On Sunday 15 December 2002 15:16, Justin Clift wrote:
> Hi Henrik,
>
> This *really* sounds like you have a system wide problem, not just a
> PostgreSQL problem.
>
> Can't imagine how moving to MySQL will help with that.  ;-)
>
> What Operating System are you using, and when was the last time you
> patched/updated it with the vendor recommended patches?

Addtionally, have you considered the possibility of a hardware
problem? I had a fileserver once which worked perfectly in "normal"
service, but died regularly and inexplicably whenever large amounts
of data were transferred over the network to the backup machine.
Turned out to be a motherboard problem, possibly in combination
with some of the other components, because we were never able
to reproduce the problem outside of that particular machine...


Ian Barwick
barwick@gmx.net


Re: Total crash of my db-server

From
"Henrik Steffen"
Date:
hi tom,

ok, I understand this.

But: There is ONLY postgres running on this particular
machine. And it's mostly when backup (dumpall) and/or
vacuuming/reindexing is going on.

In my opinion, postgresql does something on my machine
that leads to these complete system lockups.


--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Ian Barwick" <barwick@gmx.net>
Cc: "Henrik Steffen" <steffen@city-map.de>; <pgsql-general@postgresql.org>;
"Justin Clift" <justin@postgresql.org>
Sent: Sunday, December 15, 2002 5:29 PM
Subject: Re: [GENERAL] Total crash of my db-server


> >> This *really* sounds like you have a system wide problem, not just a
> >> PostgreSQL problem.
> >>
> >> Can't imagine how moving to MySQL will help with that.  ;-)
>
> Actually, moving to MySQL will make it worse.  We can say with
> confidence that a system lockup is not Postgres' fault because Postgres
> does not (and will not) run as root.  I'm not sure whether MySQL *must*
> be root, but that seems to be a pretty common way of setting it up ...
> and when you do that, you can't entirely exclude it from consideration
> when you're looking at problems that would require root privileges to
> cause.
>
> > Addtionally, have you considered the possibility of a hardware
> > problem?
>
> I tend to agree with Ian on that --- it sounds more like flaky hardware
> than anything else.  Time for memtest86 and some disk testing too.
>
> regards, tom lane


Re: Total crash of my db-server

From
"Henrik Steffen"
Date:
the whole computer crashes.

it'S mostly during dumpalls (backup) and/or vacuuming
or reindexing...

--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Lee Harr" <missive@frontiernet.net>
To: <pgsql-general@postgresql.org>
Sent: Sunday, December 15, 2002 11:25 PM
Subject: Re: [GENERAL] Total crash of my db-server


> In article <00d601c2a443$7b7b7dd0$7100a8c0@henrik>, "Henrik Steffen"
wrote:
> >
>
> > sometimes I experience a total crash of my
> > db-server while e.g. doing automated maintainance tasks:
> >
>
> The computer crashes or just the database?
> It is not clear from your description.
>
> > Always complete crashes... only reset helped.
> >
>
> reset postgres? or are you resetting the computer?
>
>
> > Most crashes occur while maintainance tasks.
> > However, there are some other crashes, too.
> >
>
> Is there any commonality between crashes? Are the
> others maybe during daily/ weekly OS reporting?
> (Generally, heavy disk activity)
>
>
> > We are running some mysql-servers here, too, and I
> > more and more often try to imagine to move my whole
> > system to a mysql-server... my collegues NEVER have
> > had such trouble with their mysql-servers yet....
> >
> > Do you have any hints for me? What can I do? My last
> > choice would be to move to mysql, but I am almost
> > desperate....
> >
>
>
> You are running mysql on the same machine? Or are these
> separate systems running mysql?
>
> My first reaction is "hardware trouble" but without
> more specifics it is tough to make a diagnosis. If
> you have a spare box, that might be a quick way to
> see if the problem is hardware related.
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: Total crash of my db-server

From
Justin Clift
Date:
Henrik Steffen wrote:
> hi tom,
>
> ok, I understand this.
>
> But: There is ONLY postgres running on this particular
> machine. And it's mostly when backup (dumpall) and/or
> vacuuming/reindexing is going on.
>
> In my opinion, postgresql does something on my machine
> that leads to these complete system lockups.

It sounds like the system lockups are occuring perhaps due to disk I/O, with PostgreSQL being the program causing the
disk load past what the system handles.

How much load does this system normally have, when there aren't dumps/vacuums/reindexes going on?  Trying to understand

how much load your system normally copes with before locking up.

?

As a thought, if this is really being caused by disk I/O loads, then it might be able to trigger it on demand with disk

benchmarking programs (just an idle thought).  That could be useful to know about.

Regards and best wishes,

Justin Clift

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi


Re: Total crash of my db-server

From
"Jan Weerts"
Date:

Hi Steffen!

>the whole computer crashes.
>
>it'S mostly during dumpalls (backup) and/or vacuuming
>or reindexing...

From my experience with two different machines: We had this behaviour on two servers under linux. Both of them were running postgres, but database load did not necessarily coincide with a dead system.

After some problems we found out, that both cases could be solved with different RAM configurations. The first machine was two years in use and suddenly started to reboot during the day (not after hours). We suspected an attack or broken hard drives. In the end we changed the RAM and since then it is happily humming in its rack.

The second machine was brand-new and we wanted to put one gig of ram in two dimm sockets. The machine was set up and postgres installed. When we started to test the database and load the system we got kernel panics or a totally unresponsive machine. In the end after a lot of testing we removed one of the RAM modules and since then it is running with just half a gig (which suffices for the application we will be using it for). Different software based RAM tests showed varying results on each run, not reproducable. We suspect the chipset to be broken in this respect despite its claimed ability to use these modules.

So my guess here is, that since postgres is not running as root, it cannot really "destroy" the kernel or anything vital. For this kind of breakdown I usually blame Windows, but since this is Linux, I really do suspect the hardware. Even if you are not experiencing this the first time as you said in another post. Are the other machines loaded (cpu and ram) by other applications or only postgres? If only postgres, try some other ram and cpu consuming app and load the machine heavily.

HTH
  Jan
p.s.: we once had a temp, who we supect to have zapped two ram
modules and two mainboards in just one month. And since the
first case proves aging of ram, I am prepared to blame hardware
in some cases.

Re: Total crash of my db-server

From
"Henrik Steffen"
Date:
Hi Justin,

average load is usually somewhat around 0.5,
at higher load there is sometimes even 3.0 or up to 7.0

it's a dedicated postgresql-machine. all accesses are made
by a webserver in the same subnet. There are about 15.000
daily users. Each request to the webserver triggers one or
more accesses to the database (using persistent connections,
mod_perl, squid as a proxy, etc.)

The webserver is set to MaxClients == 40 ... this limit has
as far as I can say never been reached before. So there should
never be more than 40 concurrent postgresql-processes.

When dumpall or reindexing / vacuum full is run at nights,
the webserver is shut down first.

disk  benchmarking programs would perhaps be interesting
(which one do you suggest?)... but note: it's a production
server, and I have had allready too much downtime this
month...

--

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Justin Clift" <justin@postgresql.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; <pgsql-general@postgresql.org>
Sent: Monday, December 16, 2002 1:59 PM
Subject: Re: [GENERAL] Total crash of my db-server


> Henrik Steffen wrote:
> > hi tom,
> >
> > ok, I understand this.
> >
> > But: There is ONLY postgres running on this particular
> > machine. And it's mostly when backup (dumpall) and/or
> > vacuuming/reindexing is going on.
> >
> > In my opinion, postgresql does something on my machine
> > that leads to these complete system lockups.
>
> It sounds like the system lockups are occuring perhaps due to disk I/O,
with PostgreSQL being the program causing the
> disk load past what the system handles.
>
> How much load does this system normally have, when there aren't
dumps/vacuums/reindexes going on?  Trying to understand
> how much load your system normally copes with before locking up.
>
> ?
>
> As a thought, if this is really being caused by disk I/O loads, then it
might be able to trigger it on demand with disk
> benchmarking programs (just an idle thought).  That could be useful to
know about.
>
> Regards and best wishes,
>
> Justin Clift
>
> --
> "My grandfather once told me that there are two kinds of people: those
> who work and those who take the credit. He told me to try to be in the
> first group; there was less competition there."
> - Indira Gandhi
>


Re: Total crash of my db-server

From
Shridhar Daithankar
Date:
On Monday 16 December 2002 07:18 pm, you wrote:
> disk  benchmarking programs would perhaps be interesting
> (which one do you suggest?)... but note: it's a production
> server, and I have had allready too much downtime this
> month...

I suggest you run pgbench with 10M records/100,000 transactions/100 users. If
it is hardware error, it should go belly up for that.

I guess it should roughly take 2GB space for this test. Just FYI..

HTH

 Shridhar


Re: Total crash of my db-server

From
Tino Wildenhain
Date:
Hi Henrik,

--On Montag, 16. Dezember 2002 13:45 +0100 Henrik Steffen
<steffen@city-map.de> wrote:

> hi tom,
>
> ok, I understand this.
>
> But: There is ONLY postgres running on this particular
> machine. And it's mostly when backup (dumpall) and/or
> vacuuming/reindexing is going on.
>
> In my opinion, postgresql does something on my machine
> that leads to these complete system lockups.

When you drive on a road and fell in a big hole, is
it your cars fault?


SCNR ;)

Regards
Tino

Re: Total crash of my db-server

From
Thomas Beutin
Date:
Hi,

On Mon, Dec 16, 2002 at 01:45:07PM +0100, Henrik Steffen wrote:
> But: There is ONLY postgres running on this particular
> machine. And it's mostly when backup (dumpall) and/or
> vacuuming/reindexing is going on.
>
> In my opinion, postgresql does something on my machine
> that leads to these complete system lockups.
May be the problem is related to the old sig11 problem:
http://www.bitwizard.nl/sig11/

Greetings,
-tb

> ----- Original Message -----
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
> To: "Ian Barwick" <barwick@gmx.net>
> Cc: "Henrik Steffen" <steffen@city-map.de>; <pgsql-general@postgresql.org>;
> "Justin Clift" <justin@postgresql.org>
> Sent: Sunday, December 15, 2002 5:29 PM
> Subject: Re: [GENERAL] Total crash of my db-server
>
>
> > >> This *really* sounds like you have a system wide problem, not just a
> > >> PostgreSQL problem.
> > >>
> > >> Can't imagine how moving to MySQL will help with that.  ;-)
> >
> > Actually, moving to MySQL will make it worse.  We can say with
> > confidence that a system lockup is not Postgres' fault because Postgres
> > does not (and will not) run as root.  I'm not sure whether MySQL *must*
> > be root, but that seems to be a pretty common way of setting it up ...
> > and when you do that, you can't entirely exclude it from consideration
> > when you're looking at problems that would require root privileges to
> > cause.
> >
> > > Addtionally, have you considered the possibility of a hardware
> > > problem?
> >
> > I tend to agree with Ian on that --- it sounds more like flaky hardware
> > than anything else.  Time for memtest86 and some disk testing too.
> >
> > regards, tom lane
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: Total crash of my db-server

From
Tino Wildenhain
Date:
Hi Henrik,

--On Montag, 16. Dezember 2002 13:40 +0100 Henrik Steffen
<steffen@city-map.de> wrote:

> Dear Justin,
>
> I am not sure whether it's really a hardware problem,
> because I have had similar problems with different machines
> and different os- and pgsql-versions before... If you
> browse the archive you will find postings from me about
> crashes and problems the last 2-3 years...
>
> I can only tell, that the mysql-servers we are running
> have never had similar trouble - and they are run on identical
> hardware and os-types under almost identical load.
>
> Currently, I am running postgres 7.3 on a Redhat Linux
> (Kernel 2.4.19). Most important software packages are
> always up2date.

The situation is, there are many many people out there who use
this RDBMS with big or even large databases. In our case we
are on about 18gig.

If the DB would crash (which it does not in our case) I'd
eventually blame the DB software. If the OS crashes, I'd
for sure blame the OS or the hardware. Whatever the software
does - it can not crash the system unless its running in
kernel space. Postgresql is not a hardware accessing driver.

I't might be that postgresql can trigger problematic details in
your setup (use large memory areas, depends on task switching,
and signal handling) but even then, the setup is problematic,
not postgresql.

Regards
Tino

Re: Total crash of my db-server

From
James Thompson
Date:
> As a thought, if this is really being caused by disk I/O loads, then
> it might be able to trigger it on demand with disk benchmarking
> programs (just an idle thought).  That could be useful to know about.

Sorry if this was mentioned previously, I didn't catch the start of this
thread.

I had a server that locked up about the same time everyday.  It wound up
being a weak cpu cooling fan was causing gradual overheat.  No clue why
the thermal protection wasn't kicking in.  A replacement fan and all my
issues went away.  Drove me nuts for about a month. :)

Take Care

->->->->->->->->->->->->->->->->->->---<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<
James Thompson    138 Cardwell Hall  Manhattan, Ks   66506    785-532-0561
Kansas State University                          Department of Mathematics
->->->->->->->->->->->->->->->->->->---<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<




Re: Total crash of my db-server

From
"Nigel J. Andrews"
Date:
On Mon, 16 Dec 2002, Justin Clift wrote:

> Henrik Steffen wrote:
> > hi tom,
> >
> > ok, I understand this.
> >
> > But: There is ONLY postgres running on this particular
> > machine. And it's mostly when backup (dumpall) and/or
> > vacuuming/reindexing is going on.
> >
> > In my opinion, postgresql does something on my machine
> > that leads to these complete system lockups.
>
> It sounds like the system lockups are occuring perhaps due to disk I/O, with PostgreSQL being the program causing the

> disk load past what the system handles.
>
>  [edited]
> As a thought, if this is really being caused by disk I/O loads, then it might be able to trigger it on demand with
disk 
> benchmarking programs (just an idle thought).  That could be useful to know about.
>

I'm coming into this late, don't know what's been said before in this thread
and considering the above mention of dumping I'm probably completely off the
charts on the uselessness of this question/suggestion but...

Are you using a 'lazy' memory allocation setup. You could find that suddenly
finding the requested memory isn't really there when told it was when
requesting it has nasty effects.

I presume the normal talk of core dumps etc has happened.


--
Nigel Andrews



Re: Total crash of my db-server

From
Tom Lane
Date:
"Henrik Steffen" <steffen@city-map.de> writes:
> In my opinion, postgresql does something on my machine
> that leads to these complete system lockups.

Once again: postgres is an unprivileged application.  It can *not* lock
up the machine that way.  You're dealing with either a hardware fault
or a kernel bug --- evidently one that only appears under heavy load,
but that doesn't make it postgres' fault.

I'd suggest asking some kernel hackers for debugging help.

            regards, tom lane

Re: Total crash of my db-server

From
"scott.marlowe"
Date:
Henrik, have you tested the memory and drive subsystems on this machine?

All this sounds very much like how an old server of mine was behaving when
I had bad memory.  No database can be expected to perform reliably on
unreliable hardware.

Look for memtest86 if you're on intel hardware.  Look at badblocks on
linux, or whatever OS you're on, for mapping out bad drive blocks.

If your machine is dying with NO ping response, it has serious problems,
and postgresql is just revealing them.

Good luck on troubleshooting this problem.

On Sun, 15 Dec 2002, Henrik Steffen wrote:

>
> Hello all,
>
> sometimes I experience a total crash of my
> db-server while e.g. doing automated maintainance tasks:
>
> At 2:30 am every night the webserver is shut
> down, so there won't be any concurrent accesses to the
> db-server. then there will be done a
> VACUUM FULL
>
> This is what happened tonight while fully vacuuming:
>
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Then, the script selects all user tables and starts
> reindexing them. Tonight, reindexeing the first table
> started and seconds later the whole server crashed.
>
> No ping, nothing else possible....
>
> This is the list of recent crashes:
> Tonight 02:42 am
> Yesterday night 02:39 am
> Tuesday at 10:34 am
> Last saturday at 10:44 am
> Last Tuesday at 02:19 am
> The saturday before at 04:01 am
> The thursday before at 04:02 am
> the tuesday before at 02:25 am
>
> Always complete crashes... only reset helped.
>
> Most crashes occur while maintainance tasks.
> However, there are some other crashes, too.
>
> There are never any hints in /var/log/messages
>
> I upgraded to postgresql 7.3 recently, but it doesn't
> seem to help either.
>
> I am almost desperate.
>
> We are running some mysql-servers here, too, and I
> more and more often try to imagine to move my whole
> system to a mysql-server... my collegues NEVER have
> had such trouble with their mysql-servers yet....
>
> Do you have any hints for me? What can I do? My last
> choice would be to move to mysql, but I am almost
> desperate....
>
> thanks for your help
>
> --
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


Re: Total crash of my db-server

From
Brian Hirt
Date:
henrik,

i had the exact same problem as well earlier this year on a dual xeon.
the problem ended up being memory.  even though we had registered ecc
memory, that didn't make any difference.

--brian

On Mon, 2002-12-16 at 09:23, scott.marlowe wrote:
> Henrik, have you tested the memory and drive subsystems on this machine?
>
> All this sounds very much like how an old server of mine was behaving when
> I had bad memory.  No database can be expected to perform reliably on
> unreliable hardware.
>
> Look for memtest86 if you're on intel hardware.  Look at badblocks on
> linux, or whatever OS you're on, for mapping out bad drive blocks.
>
> If your machine is dying with NO ping response, it has serious problems,
> and postgresql is just revealing them.
>
> Good luck on troubleshooting this problem.
>
> On Sun, 15 Dec 2002, Henrik Steffen wrote:
>
> >
> > Hello all,
> >
> > sometimes I experience a total crash of my
> > db-server while e.g. doing automated maintainance tasks:
> >
> > At 2:30 am every night the webserver is shut
> > down, so there won't be any concurrent accesses to the
> > db-server. then there will be done a
> > VACUUM FULL
> >
> > This is what happened tonight while fully vacuuming:
> >
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> > Then, the script selects all user tables and starts
> > reindexing them. Tonight, reindexeing the first table
> > started and seconds later the whole server crashed.
> >
> > No ping, nothing else possible....
> >
> > This is the list of recent crashes:
> > Tonight 02:42 am
> > Yesterday night 02:39 am
> > Tuesday at 10:34 am
> > Last saturday at 10:44 am
> > Last Tuesday at 02:19 am
> > The saturday before at 04:01 am
> > The thursday before at 04:02 am
> > the tuesday before at 02:25 am
> >
> > Always complete crashes... only reset helped.
> >
> > Most crashes occur while maintainance tasks.
> > However, there are some other crashes, too.
> >
> > There are never any hints in /var/log/messages
> >
> > I upgraded to postgresql 7.3 recently, but it doesn't
> > seem to help either.
> >
> > I am almost desperate.
> >
> > We are running some mysql-servers here, too, and I
> > more and more often try to imagine to move my whole
> > system to a mysql-server... my collegues NEVER have
> > had such trouble with their mysql-servers yet....
> >
> > Do you have any hints for me? What can I do? My last
> > choice would be to move to mysql, but I am almost
> > desperate....
> >
> > thanks for your help
> >
> > --
> >
> > Mit freundlichem Gruß
> >
> > Henrik Steffen
> > Geschäftsführer
> >
> > top concepts Internetmarketing GmbH
> > Am Steinkamp 7 - D-21684 Stade - Germany
> > --------------------------------------------------------
> > http://www.topconcepts.com          Tel. +49 4141 991230
> > mail: steffen@topconcepts.com       Fax. +49 4141 991233
> > --------------------------------------------------------
> > 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> > --------------------------------------------------------
> > Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de
> > System-Partner gesucht: http://www.franchise.city-map.de
> > --------------------------------------------------------
> > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> > --------------------------------------------------------
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
--
Brian Hirt <bhirt@mobygames.com>


Re: Total crash of my db-server

From
Kevin Brown
Date:
Nigel J. Andrews wrote:
> Are you using a 'lazy' memory allocation setup. You could find that suddenly
> finding the requested memory isn't really there when told it was when
> requesting it has nasty effects.

But this wouldn't cause the kernel to crash.  The kernel might start
killing processes, possibly randomly, in an effort to free memory (and
others would die of their own accord as their attempts to allocate
memory fail), but it shouldn't cause the kernel itself to hang or
crash.


--
Kevin Brown                          kevin@sysexperts.com

Re: Total crash of my db-server

From
Kevin Brown
Date:
Henrik Steffen wrote:
> In my opinion, postgresql does something on my machine
> that leads to these complete system lockups.

PostgreSQL might beat on the disk subsystem hard enough to show faults
in it, or perhaps it uses enough CPU that the CPU isn't being cooled
properly anymore, etc.

But regardless, that only means that PostgreSQL is a trigger, not an
actual root cause.  And it means that you will almost certainly have
problems even after switching database engines.

You mentioned that you're using a hardware RAID controller.  There is
always the possibility that the driver for that controller isn't
entirely stable.

If you have an identical box you can drop in place, I highly recommend
that you do so.  I'm betting that your problems will disappear after
you do that.


--
Kevin Brown                          kevin@sysexperts.com

Re: Total crash of my db-server

From
Date:
I concur with this, I had *exactly* this problem.  My hardware vendor
overclocked my intel cpu, which was fine when it was an NT box because NT
thrashes on the disk.

But when running postgres on Linux on that machine (we had to put more
hardware behind NT) the hardware test utilities all showed good hardware,
but there were random bit errors that went away when I removed the
overclocking.  NT never encountered that because it was choking on disk I/O,
not on CPU cycles.

Terry Fielder
Manager Software Development and Deployment
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com



> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kevin Brown
> Sent: Monday, December 16, 2002 6:15 PM
> To: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Total crash of my db-server
>
>
> Henrik Steffen wrote:
> > In my opinion, postgresql does something on my machine
> > that leads to these complete system lockups.
>
> PostgreSQL might beat on the disk subsystem hard enough to show faults
> in it, or perhaps it uses enough CPU that the CPU isn't being cooled
> properly anymore, etc.
>
> But regardless, that only means that PostgreSQL is a trigger, not an
> actual root cause.  And it means that you will almost certainly have
> problems even after switching database engines.
>
> You mentioned that you're using a hardware RAID controller.  There is
> always the possibility that the driver for that controller isn't
> entirely stable.
>
> If you have an identical box you can drop in place, I highly recommend
> that you do so.  I'm betting that your problems will disappear after
> you do that.
>
>
> --
> Kevin Brown
kevin@sysexperts.com
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>


Server testing.

From
"scott.marlowe"
Date:
This recent thread about a server crashing got me to thinking of server
acceptance testing.

When you are faced with the daunting task of testing a server, you should
be trying to break it.  Honestly, this is the most common mistake I see,
if folks ordering a new server and simply assuming there's no problems
with it.  Assume all hardware is bad until you've proven to yourself
otherwise.  No at what point your hardware will be brought to it's knees
(or worse) before your users can do that to you.

Here are a few good tests for bad hardware that I've found, if anyone else
has any, please chip in.  Note that not all failures are deterministic and
repeatable.  Some show up very seldomly, or only when the server room is
above 70 degress.  It's easy to know when you've got a big problem with
your hardware, but often hard to see the little ones.

The first thing I test with is compiling the linux kernel AND / OR
compiling Postgresql.  Both are complex projects that stress the system
fairly well.  Toss in a '-j 8' setting and watch the machine chew up
memory and CPU time.

It's easy to write a script that basically does a make clean;make over
several iterations and stores the md5sum of the outputted make data.  They
should all be the same.  Set the box up to compile the linux kernel 1000
times over the weekend.  Check the md5s, see if you have a few different.
i've seen boxes with bad memory compile the linux kernel 10 or 20 times
before generating an error.  most of the time a bad memory module is
obvious, sometimes not.

memtest86 is pretty good.  It too, can miss a bad memory location if the
memory is right MOST of the time, but sometimes flakes out on you.  but
you may need to run it multiple times.

Copy HUGE files across your drive arrays, and md5sum them at the beginning
and end.  The md5sum should always match, if it doesn't match, even just
once out of hundreds of copies, your machine has a problem.

Make sure you machine can operate reliably at the temperatures it may have
to experience.  I've seen plenty of servers that run fine in a nice cold
room (say 60 degrees F or less) but failed when the temp rose 5 or 10
degrees.  A server that fails at 72 degrees F consistently is too heat
sensitive to be reliable over the long haul.  Remember that dust
collecting and age make electronics more susceptable to heat failure, so a
new server that fails at 72, might fail at 70 next year, and 68 the year
after that.

I know I'm missing lots, so feel free to join it.

The two most important concepts for server acceptance testing:

1:  Assume it is broken.
2:  Try to prove it is broken.

That way, when it DOES work, you'll be pleasantly surprised, which is way
better than assuming it works and finding out during production that your
new server has issues.

An aside:  Many newer users get upset when they get told they must have
bad hardware, because Postgresql just doesn't act like that.  But it's
true, Postgresql doesn't just act flakey.

This reminds me of my favorite saying:  "When you hear hoofbeats, don't
think Zebra!"  Loosely translated, when your postgresql box starts acting
up, don't think it's postgresql's "fault" because it almost never is.


Re: Server testing.

From
Kenneth Godee
Date:
I also believe when buying servers, spend the extra money
and buy quality servers. Our new cpq DL380 G2 has redunant everything....
mem,cpu,bios,fans,controllers,drives,nics. It costs a little(lot) extra, but
for me it's ALWAYS paid in the long run.

What kind of server is this that keeps crashing?
Did I read this thread right earlier, this system has Raid 1 "IDE" drives?
Must be a new direction in server class machines?

Just the other night I wrote a bad sql statement that was interesting
in that it would blow up postgres! It would chew cpu @ 100% then, slowy chew
up all available memory, and then move on to chew up all available swap space,
and finally you would end up with a "killed" process. Hey, what can I say
I had to run it several more times just to see how postgres, linux and
the hardware handled the whole thing but it never, locked up the hardware.
Had a couple of processes left over that I had to kill by doing a pg_ctl fast restart
but that was it.

> The two most important concepts for server acceptance testing:
>
> 1:  Assume it is broken.
> 2:  Try to prove it is broken.
>
> That way, when it DOES work, you'll be pleasantly surprised, which is way
> better than assuming it works and finding out during production that your
> new server has issues.
>

Re: Server testing.

From
Lincoln Yeoh
Date:
I used the cpuburn program too ( http://users.ev1.net/~redelm/ ). It REALLY
heats up the processor - interesting to watch the +5 volt immediately drop
significantly (using healthd -d). The other voltages also change accordingly.

The test doesn't touch files unlike a kernel recompile, so if you find you
have a flaky system there's a lower chance of a corrupted filesystem.

Plus compiling doesn't put as much load on my CPU - the +5V doesn't drop as
much. I suspect there's not as much FPU access whilst compiling. And the
FPU units are significant power consumers.

Not sure what to use for testing P4s tho (there isn't a cpuburn test
specifically for P4s). I'm using an Athlon XP so I use the burnK7 program.

Good luck,
Link.

At 05:12 PM 12/16/02 -0700, scott.marlowe wrote:

>This recent thread about a server crashing got me to thinking of server
>acceptance testing.
>
>When you are faced with the daunting task of testing a server, you should
>be trying to break it.  Honestly, this is the most common mistake I see,
>if folks ordering a new server and simply assuming there's no problems
>with it.  Assume all hardware is bad until you've proven to yourself
>otherwise.  No at what point your hardware will be brought to it's knees
>(or worse) before your users can do that to you.
>
>Here are a few good tests for bad hardware that I've found, if anyone else
>has any, please chip in.  Note that not all failures are deterministic and
>repeatable.  Some show up very seldomly, or only when the server room is
>above 70 degress.  It's easy to know when you've got a big problem with
>your hardware, but often hard to see the little ones.
>
>The first thing I test with is compiling the linux kernel AND / OR
>compiling Postgresql.  Both are complex projects that stress the system
>fairly well.  Toss in a '-j 8' setting and watch the machine chew up
>memory and CPU time.



Re: Server testing.

From
"scott.marlowe"
Date:
On Tue, 17 Dec 2002, Lincoln Yeoh wrote:

> I used the cpuburn program too ( http://users.ev1.net/~redelm/ ). It REALLY
> heats up the processor - interesting to watch the +5 volt immediately drop
> significantly (using healthd -d). The other voltages also change accordingly.
>
> The test doesn't touch files unlike a kernel recompile, so if you find you
> have a flaky system there's a lower chance of a corrupted filesystem.
>
> Plus compiling doesn't put as much load on my CPU - the +5V doesn't drop as
> much. I suspect there's not as much FPU access whilst compiling. And the
> FPU units are significant power consumers.
>
> Not sure what to use for testing P4s tho (there isn't a cpuburn test
> specifically for P4s). I'm using an Athlon XP so I use the burnK7 program.

I've used quake II as a good CPU cooker as well.  any good FPS (First
Person Shooter) usually cranks up the heat on the CPU.  Plus it is fun to
leave Quake on your new dual AMD 2400 MP system in the server room in demo
mode for a week or so to burn it in.

Of course, what we're all saying is how to beat your server like a mule
BEFORE it goes into production. :-)


Re: Server testing.

From
"Shridhar Daithankar"
Date:
On 17 Dec 2002 at 8:40, scott.marlowe wrote:
> Of course, what we're all saying is how to beat your server like a mule
> BEFORE it goes into production. :-)

And before it's warranty ends.. Rather it is an learning exercise to play with
PU whose warranty is about to end..

Bye
 Shridhar

--
scribline, n.:    The blank area on the back of credit cards where one's signature
goes.        -- "Sniglets", Rich Hall & Friends