Thread: Total crash of my db-server
Hello all, sometimes I experience a total crash of my db-server while e.g. doing automated maintainance tasks: At 2:30 am every night the webserver is shut down, so there won't be any concurrent accesses to the db-server. then there will be done a VACUUM FULL This is what happened tonight while fully vacuuming: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Then, the script selects all user tables and starts reindexing them. Tonight, reindexeing the first table started and seconds later the whole server crashed. No ping, nothing else possible.... This is the list of recent crashes: Tonight 02:42 am Yesterday night 02:39 am Tuesday at 10:34 am Last saturday at 10:44 am Last Tuesday at 02:19 am The saturday before at 04:01 am The thursday before at 04:02 am the tuesday before at 02:25 am Always complete crashes... only reset helped. Most crashes occur while maintainance tasks. However, there are some other crashes, too. There are never any hints in /var/log/messages I upgraded to postgresql 7.3 recently, but it doesn't seem to help either. I am almost desperate. We are running some mysql-servers here, too, and I more and more often try to imagine to move my whole system to a mysql-server... my collegues NEVER have had such trouble with their mysql-servers yet.... Do you have any hints for me? What can I do? My last choice would be to move to mysql, but I am almost desperate.... thanks for your help -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 --------------------------------------------------------
Hi Henrik, This *really* sounds like you have a system wide problem, not just a PostgreSQL problem. Can't imagine how moving to MySQL will help with that. ;-) What Operating System are you using, and when was the last time you patched/updated it with the vendor recommended patches? Regards and best wishes, Justin Clift Henrik Steffen wrote: > Hello all, > > sometimes I experience a total crash of my > db-server while e.g. doing automated maintainance tasks: > > At 2:30 am every night the webserver is shut > down, so there won't be any concurrent accesses to the > db-server. then there will be done a > VACUUM FULL > > This is what happened tonight while fully vacuuming: > > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > Then, the script selects all user tables and starts > reindexing them. Tonight, reindexeing the first table > started and seconds later the whole server crashed. > > No ping, nothing else possible.... > > This is the list of recent crashes: > Tonight 02:42 am > Yesterday night 02:39 am > Tuesday at 10:34 am > Last saturday at 10:44 am > Last Tuesday at 02:19 am > The saturday before at 04:01 am > The thursday before at 04:02 am > the tuesday before at 02:25 am > > Always complete crashes... only reset helped. > > Most crashes occur while maintainance tasks. > However, there are some other crashes, too. > > There are never any hints in /var/log/messages > > I upgraded to postgresql 7.3 recently, but it doesn't > seem to help either. > > I am almost desperate. > > We are running some mysql-servers here, too, and I > more and more often try to imagine to move my whole > system to a mysql-server... my collegues NEVER have > had such trouble with their mysql-servers yet.... > > Do you have any hints for me? What can I do? My last > choice would be to move to mysql, but I am almost > desperate.... > > thanks for your help > > -- > > Mit freundlichem Gruß > > Henrik Steffen > Geschäftsführer > > top concepts Internetmarketing GmbH > Am Steinkamp 7 - D-21684 Stade - Germany > -------------------------------------------------------- > http://www.topconcepts.com Tel. +49 4141 991230 > mail: steffen@topconcepts.com Fax. +49 4141 991233 > -------------------------------------------------------- > 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) > -------------------------------------------------------- > Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de > System-Partner gesucht: http://www.franchise.city-map.de > -------------------------------------------------------- > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 > -------------------------------------------------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
On Sunday 15 December 2002 15:16, Justin Clift wrote: > Hi Henrik, > > This *really* sounds like you have a system wide problem, not just a > PostgreSQL problem. > > Can't imagine how moving to MySQL will help with that. ;-) > > What Operating System are you using, and when was the last time you > patched/updated it with the vendor recommended patches? Addtionally, have you considered the possibility of a hardware problem? I had a fileserver once which worked perfectly in "normal" service, but died regularly and inexplicably whenever large amounts of data were transferred over the network to the backup machine. Turned out to be a motherboard problem, possibly in combination with some of the other components, because we were never able to reproduce the problem outside of that particular machine... Ian Barwick barwick@gmx.net
>> This *really* sounds like you have a system wide problem, not just a >> PostgreSQL problem. >> >> Can't imagine how moving to MySQL will help with that. ;-) Actually, moving to MySQL will make it worse. We can say with confidence that a system lockup is not Postgres' fault because Postgres does not (and will not) run as root. I'm not sure whether MySQL *must* be root, but that seems to be a pretty common way of setting it up ... and when you do that, you can't entirely exclude it from consideration when you're looking at problems that would require root privileges to cause. > Addtionally, have you considered the possibility of a hardware > problem? I tend to agree with Ian on that --- it sounds more like flaky hardware than anything else. Time for memtest86 and some disk testing too. regards, tom lane
In article <00d601c2a443$7b7b7dd0$7100a8c0@henrik>, "Henrik Steffen" wrote: > > sometimes I experience a total crash of my > db-server while e.g. doing automated maintainance tasks: > The computer crashes or just the database? It is not clear from your description. > Always complete crashes... only reset helped. > reset postgres? or are you resetting the computer? > Most crashes occur while maintainance tasks. > However, there are some other crashes, too. > Is there any commonality between crashes? Are the others maybe during daily/ weekly OS reporting? (Generally, heavy disk activity) > We are running some mysql-servers here, too, and I > more and more often try to imagine to move my whole > system to a mysql-server... my collegues NEVER have > had such trouble with their mysql-servers yet.... > > Do you have any hints for me? What can I do? My last > choice would be to move to mysql, but I am almost > desperate.... > You are running mysql on the same machine? Or are these separate systems running mysql? My first reaction is "hardware trouble" but without more specifics it is tough to make a diagnosis. If you have a spare box, that might be a quick way to see if the problem is hardware related.
Henrik Steffen wrote: > > Hello all, > > sometimes I experience a total crash of my > db-server while e.g. doing automated maintainance tasks: [...] > Then, the script selects all user tables and starts > reindexing them. Tonight, reindexeing the first table > started and seconds later the whole server crashed. > > No ping, nothing else possible.... If you can't ping the system then it means that the operating system itself has stopped working properly (the networking stack is managed solely by the operating system). That means that you've either managed to tickle a bug in the operating system itself or you have a hardware problem. You didn't mention what OS you're running under but it's more likely that you have a hardware problem than an OS bug. Moving to MySQL won't help you here, I'm afraid. Only fixing your hardware will. If this is a system that you depend on for production, I recommend that you use ECC memory if at all possible. At least then you won't have to worry nearly as much about the possibility of bad RAM silently causing errors... -- Kevin Brown kevin@sysexperts.com
Dear Justin, I am not sure whether it's really a hardware problem, because I have had similar problems with different machines and different os- and pgsql-versions before... If you browse the archive you will find postings from me about crashes and problems the last 2-3 years... I can only tell, that the mysql-servers we are running have never had similar trouble - and they are run on identical hardware and os-types under almost identical load. Currently, I am running postgres 7.3 on a Redhat Linux (Kernel 2.4.19). Most important software packages are always up2date. -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 -------------------------------------------------------- ----- Original Message ----- From: "Justin Clift" <justin@postgresql.org> To: "Henrik Steffen" <steffen@city-map.de> Cc: <pgsql-general@postgresql.org> Sent: Sunday, December 15, 2002 3:16 PM Subject: Re: [GENERAL] Total crash of my db-server Hi Henrik, This *really* sounds like you have a system wide problem, not just a PostgreSQL problem. Can't imagine how moving to MySQL will help with that. ;-) What Operating System are you using, and when was the last time you patched/updated it with the vendor recommended patches? Regards and best wishes, Justin Clift Henrik Steffen wrote: > Hello all, > > sometimes I experience a total crash of my > db-server while e.g. doing automated maintainance tasks: > > At 2:30 am every night the webserver is shut > down, so there won't be any concurrent accesses to the > db-server. then there will be done a > VACUUM FULL > > This is what happened tonight while fully vacuuming: > > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > Then, the script selects all user tables and starts > reindexing them. Tonight, reindexeing the first table > started and seconds later the whole server crashed. > > No ping, nothing else possible.... > > This is the list of recent crashes: > Tonight 02:42 am > Yesterday night 02:39 am > Tuesday at 10:34 am > Last saturday at 10:44 am > Last Tuesday at 02:19 am > The saturday before at 04:01 am > The thursday before at 04:02 am > the tuesday before at 02:25 am > > Always complete crashes... only reset helped. > > Most crashes occur while maintainance tasks. > However, there are some other crashes, too. > > There are never any hints in /var/log/messages > > I upgraded to postgresql 7.3 recently, but it doesn't > seem to help either. > > I am almost desperate. > > We are running some mysql-servers here, too, and I > more and more often try to imagine to move my whole > system to a mysql-server... my collegues NEVER have > had such trouble with their mysql-servers yet.... > > Do you have any hints for me? What can I do? My last > choice would be to move to mysql, but I am almost > desperate.... > > thanks for your help > > -- > > Mit freundlichem Gruß > > Henrik Steffen > Geschäftsführer > > top concepts Internetmarketing GmbH > Am Steinkamp 7 - D-21684 Stade - Germany > -------------------------------------------------------- > http://www.topconcepts.com Tel. +49 4141 991230 > mail: steffen@topconcepts.com Fax. +49 4141 991233 > -------------------------------------------------------- > 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) > -------------------------------------------------------- > Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de > System-Partner gesucht: http://www.franchise.city-map.de > -------------------------------------------------------- > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 > -------------------------------------------------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
yes, I have thought about it... I am not sure if it's a hardware problem. We upgraded to ECC-RAM recently and hoped it would help, but it didn't. It's a hardware raid 1 system (mirroring) on IDE harddrives. -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 -------------------------------------------------------- ----- Original Message ----- From: "Ian Barwick" <barwick@gmx.net> To: "Henrik Steffen" <steffen@city-map.de> Cc: <pgsql-general@postgresql.org>; "Justin Clift" <justin@postgresql.org> Sent: Sunday, December 15, 2002 4:47 PM Subject: Re: [GENERAL] Total crash of my db-server On Sunday 15 December 2002 15:16, Justin Clift wrote: > Hi Henrik, > > This *really* sounds like you have a system wide problem, not just a > PostgreSQL problem. > > Can't imagine how moving to MySQL will help with that. ;-) > > What Operating System are you using, and when was the last time you > patched/updated it with the vendor recommended patches? Addtionally, have you considered the possibility of a hardware problem? I had a fileserver once which worked perfectly in "normal" service, but died regularly and inexplicably whenever large amounts of data were transferred over the network to the backup machine. Turned out to be a motherboard problem, possibly in combination with some of the other components, because we were never able to reproduce the problem outside of that particular machine... Ian Barwick barwick@gmx.net
hi tom, ok, I understand this. But: There is ONLY postgres running on this particular machine. And it's mostly when backup (dumpall) and/or vacuuming/reindexing is going on. In my opinion, postgresql does something on my machine that leads to these complete system lockups. -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 -------------------------------------------------------- ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Ian Barwick" <barwick@gmx.net> Cc: "Henrik Steffen" <steffen@city-map.de>; <pgsql-general@postgresql.org>; "Justin Clift" <justin@postgresql.org> Sent: Sunday, December 15, 2002 5:29 PM Subject: Re: [GENERAL] Total crash of my db-server > >> This *really* sounds like you have a system wide problem, not just a > >> PostgreSQL problem. > >> > >> Can't imagine how moving to MySQL will help with that. ;-) > > Actually, moving to MySQL will make it worse. We can say with > confidence that a system lockup is not Postgres' fault because Postgres > does not (and will not) run as root. I'm not sure whether MySQL *must* > be root, but that seems to be a pretty common way of setting it up ... > and when you do that, you can't entirely exclude it from consideration > when you're looking at problems that would require root privileges to > cause. > > > Addtionally, have you considered the possibility of a hardware > > problem? > > I tend to agree with Ian on that --- it sounds more like flaky hardware > than anything else. Time for memtest86 and some disk testing too. > > regards, tom lane
the whole computer crashes. it'S mostly during dumpalls (backup) and/or vacuuming or reindexing... -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 -------------------------------------------------------- ----- Original Message ----- From: "Lee Harr" <missive@frontiernet.net> To: <pgsql-general@postgresql.org> Sent: Sunday, December 15, 2002 11:25 PM Subject: Re: [GENERAL] Total crash of my db-server > In article <00d601c2a443$7b7b7dd0$7100a8c0@henrik>, "Henrik Steffen" wrote: > > > > > sometimes I experience a total crash of my > > db-server while e.g. doing automated maintainance tasks: > > > > The computer crashes or just the database? > It is not clear from your description. > > > Always complete crashes... only reset helped. > > > > reset postgres? or are you resetting the computer? > > > > Most crashes occur while maintainance tasks. > > However, there are some other crashes, too. > > > > Is there any commonality between crashes? Are the > others maybe during daily/ weekly OS reporting? > (Generally, heavy disk activity) > > > > We are running some mysql-servers here, too, and I > > more and more often try to imagine to move my whole > > system to a mysql-server... my collegues NEVER have > > had such trouble with their mysql-servers yet.... > > > > Do you have any hints for me? What can I do? My last > > choice would be to move to mysql, but I am almost > > desperate.... > > > > > You are running mysql on the same machine? Or are these > separate systems running mysql? > > My first reaction is "hardware trouble" but without > more specifics it is tough to make a diagnosis. If > you have a spare box, that might be a quick way to > see if the problem is hardware related. > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster
Henrik Steffen wrote: > hi tom, > > ok, I understand this. > > But: There is ONLY postgres running on this particular > machine. And it's mostly when backup (dumpall) and/or > vacuuming/reindexing is going on. > > In my opinion, postgresql does something on my machine > that leads to these complete system lockups. It sounds like the system lockups are occuring perhaps due to disk I/O, with PostgreSQL being the program causing the disk load past what the system handles. How much load does this system normally have, when there aren't dumps/vacuums/reindexes going on? Trying to understand how much load your system normally copes with before locking up. ? As a thought, if this is really being caused by disk I/O loads, then it might be able to trigger it on demand with disk benchmarking programs (just an idle thought). That could be useful to know about. Regards and best wishes, Justin Clift -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Hi Steffen!
>the whole computer crashes.
>
>it'S mostly during dumpalls (backup) and/or vacuuming
>or reindexing...
From my experience with two different machines: We had this behaviour on two servers under linux. Both of them were running postgres, but database load did not necessarily coincide with a dead system.
After some problems we found out, that both cases could be solved with different RAM configurations. The first machine was two years in use and suddenly started to reboot during the day (not after hours). We suspected an attack or broken hard drives. In the end we changed the RAM and since then it is happily humming in its rack.
The second machine was brand-new and we wanted to put one gig of ram in two dimm sockets. The machine was set up and postgres installed. When we started to test the database and load the system we got kernel panics or a totally unresponsive machine. In the end after a lot of testing we removed one of the RAM modules and since then it is running with just half a gig (which suffices for the application we will be using it for). Different software based RAM tests showed varying results on each run, not reproducable. We suspect the chipset to be broken in this respect despite its claimed ability to use these modules.
So my guess here is, that since postgres is not running as root, it cannot really "destroy" the kernel or anything vital. For this kind of breakdown I usually blame Windows, but since this is Linux, I really do suspect the hardware. Even if you are not experiencing this the first time as you said in another post. Are the other machines loaded (cpu and ram) by other applications or only postgres? If only postgres, try some other ram and cpu consuming app and load the machine heavily.
HTH
Jan
p.s.: we once had a temp, who we supect to have zapped two ram
modules and two mainboards in just one month. And since the
first case proves aging of ram, I am prepared to blame hardware
in some cases.
Hi Justin, average load is usually somewhat around 0.5, at higher load there is sometimes even 3.0 or up to 7.0 it's a dedicated postgresql-machine. all accesses are made by a webserver in the same subnet. There are about 15.000 daily users. Each request to the webserver triggers one or more accesses to the database (using persistent connections, mod_perl, squid as a proxy, etc.) The webserver is set to MaxClients == 40 ... this limit has as far as I can say never been reached before. So there should never be more than 40 concurrent postgresql-processes. When dumpall or reindexing / vacuum full is run at nights, the webserver is shut down first. disk benchmarking programs would perhaps be interesting (which one do you suggest?)... but note: it's a production server, and I have had allready too much downtime this month... -- Mit freundlichem Gruß Henrik Steffen Geschäftsführer top concepts Internetmarketing GmbH Am Steinkamp 7 - D-21684 Stade - Germany -------------------------------------------------------- http://www.topconcepts.com Tel. +49 4141 991230 mail: steffen@topconcepts.com Fax. +49 4141 991233 -------------------------------------------------------- 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) -------------------------------------------------------- Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de System-Partner gesucht: http://www.franchise.city-map.de -------------------------------------------------------- Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 -------------------------------------------------------- ----- Original Message ----- From: "Justin Clift" <justin@postgresql.org> To: "Henrik Steffen" <steffen@city-map.de> Cc: "Tom Lane" <tgl@sss.pgh.pa.us>; <pgsql-general@postgresql.org> Sent: Monday, December 16, 2002 1:59 PM Subject: Re: [GENERAL] Total crash of my db-server > Henrik Steffen wrote: > > hi tom, > > > > ok, I understand this. > > > > But: There is ONLY postgres running on this particular > > machine. And it's mostly when backup (dumpall) and/or > > vacuuming/reindexing is going on. > > > > In my opinion, postgresql does something on my machine > > that leads to these complete system lockups. > > It sounds like the system lockups are occuring perhaps due to disk I/O, with PostgreSQL being the program causing the > disk load past what the system handles. > > How much load does this system normally have, when there aren't dumps/vacuums/reindexes going on? Trying to understand > how much load your system normally copes with before locking up. > > ? > > As a thought, if this is really being caused by disk I/O loads, then it might be able to trigger it on demand with disk > benchmarking programs (just an idle thought). That could be useful to know about. > > Regards and best wishes, > > Justin Clift > > -- > "My grandfather once told me that there are two kinds of people: those > who work and those who take the credit. He told me to try to be in the > first group; there was less competition there." > - Indira Gandhi >
On Monday 16 December 2002 07:18 pm, you wrote: > disk benchmarking programs would perhaps be interesting > (which one do you suggest?)... but note: it's a production > server, and I have had allready too much downtime this > month... I suggest you run pgbench with 10M records/100,000 transactions/100 users. If it is hardware error, it should go belly up for that. I guess it should roughly take 2GB space for this test. Just FYI.. HTH Shridhar
Hi Henrik, --On Montag, 16. Dezember 2002 13:45 +0100 Henrik Steffen <steffen@city-map.de> wrote: > hi tom, > > ok, I understand this. > > But: There is ONLY postgres running on this particular > machine. And it's mostly when backup (dumpall) and/or > vacuuming/reindexing is going on. > > In my opinion, postgresql does something on my machine > that leads to these complete system lockups. When you drive on a road and fell in a big hole, is it your cars fault? SCNR ;) Regards Tino
Hi, On Mon, Dec 16, 2002 at 01:45:07PM +0100, Henrik Steffen wrote: > But: There is ONLY postgres running on this particular > machine. And it's mostly when backup (dumpall) and/or > vacuuming/reindexing is going on. > > In my opinion, postgresql does something on my machine > that leads to these complete system lockups. May be the problem is related to the old sig11 problem: http://www.bitwizard.nl/sig11/ Greetings, -tb > ----- Original Message ----- > From: "Tom Lane" <tgl@sss.pgh.pa.us> > To: "Ian Barwick" <barwick@gmx.net> > Cc: "Henrik Steffen" <steffen@city-map.de>; <pgsql-general@postgresql.org>; > "Justin Clift" <justin@postgresql.org> > Sent: Sunday, December 15, 2002 5:29 PM > Subject: Re: [GENERAL] Total crash of my db-server > > > > >> This *really* sounds like you have a system wide problem, not just a > > >> PostgreSQL problem. > > >> > > >> Can't imagine how moving to MySQL will help with that. ;-) > > > > Actually, moving to MySQL will make it worse. We can say with > > confidence that a system lockup is not Postgres' fault because Postgres > > does not (and will not) run as root. I'm not sure whether MySQL *must* > > be root, but that seems to be a pretty common way of setting it up ... > > and when you do that, you can't entirely exclude it from consideration > > when you're looking at problems that would require root privileges to > > cause. > > > > > Addtionally, have you considered the possibility of a hardware > > > problem? > > > > I tend to agree with Ian on that --- it sounds more like flaky hardware > > than anything else. Time for memtest86 and some disk testing too. > > > > regards, tom lane -- Thomas Beutin tb@laokoon.IN-Berlin.DE Beam me up, Scotty. There is no intelligent live down in Redmond.
Hi Henrik, --On Montag, 16. Dezember 2002 13:40 +0100 Henrik Steffen <steffen@city-map.de> wrote: > Dear Justin, > > I am not sure whether it's really a hardware problem, > because I have had similar problems with different machines > and different os- and pgsql-versions before... If you > browse the archive you will find postings from me about > crashes and problems the last 2-3 years... > > I can only tell, that the mysql-servers we are running > have never had similar trouble - and they are run on identical > hardware and os-types under almost identical load. > > Currently, I am running postgres 7.3 on a Redhat Linux > (Kernel 2.4.19). Most important software packages are > always up2date. The situation is, there are many many people out there who use this RDBMS with big or even large databases. In our case we are on about 18gig. If the DB would crash (which it does not in our case) I'd eventually blame the DB software. If the OS crashes, I'd for sure blame the OS or the hardware. Whatever the software does - it can not crash the system unless its running in kernel space. Postgresql is not a hardware accessing driver. I't might be that postgresql can trigger problematic details in your setup (use large memory areas, depends on task switching, and signal handling) but even then, the setup is problematic, not postgresql. Regards Tino
> As a thought, if this is really being caused by disk I/O loads, then > it might be able to trigger it on demand with disk benchmarking > programs (just an idle thought). That could be useful to know about. Sorry if this was mentioned previously, I didn't catch the start of this thread. I had a server that locked up about the same time everyday. It wound up being a weak cpu cooling fan was causing gradual overheat. No clue why the thermal protection wasn't kicking in. A replacement fan and all my issues went away. Drove me nuts for about a month. :) Take Care ->->->->->->->->->->->->->->->->->->---<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-< James Thompson 138 Cardwell Hall Manhattan, Ks 66506 785-532-0561 Kansas State University Department of Mathematics ->->->->->->->->->->->->->->->->->->---<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<-<
On Mon, 16 Dec 2002, Justin Clift wrote: > Henrik Steffen wrote: > > hi tom, > > > > ok, I understand this. > > > > But: There is ONLY postgres running on this particular > > machine. And it's mostly when backup (dumpall) and/or > > vacuuming/reindexing is going on. > > > > In my opinion, postgresql does something on my machine > > that leads to these complete system lockups. > > It sounds like the system lockups are occuring perhaps due to disk I/O, with PostgreSQL being the program causing the > disk load past what the system handles. > > [edited] > As a thought, if this is really being caused by disk I/O loads, then it might be able to trigger it on demand with disk > benchmarking programs (just an idle thought). That could be useful to know about. > I'm coming into this late, don't know what's been said before in this thread and considering the above mention of dumping I'm probably completely off the charts on the uselessness of this question/suggestion but... Are you using a 'lazy' memory allocation setup. You could find that suddenly finding the requested memory isn't really there when told it was when requesting it has nasty effects. I presume the normal talk of core dumps etc has happened. -- Nigel Andrews
"Henrik Steffen" <steffen@city-map.de> writes: > In my opinion, postgresql does something on my machine > that leads to these complete system lockups. Once again: postgres is an unprivileged application. It can *not* lock up the machine that way. You're dealing with either a hardware fault or a kernel bug --- evidently one that only appears under heavy load, but that doesn't make it postgres' fault. I'd suggest asking some kernel hackers for debugging help. regards, tom lane
Henrik, have you tested the memory and drive subsystems on this machine? All this sounds very much like how an old server of mine was behaving when I had bad memory. No database can be expected to perform reliably on unreliable hardware. Look for memtest86 if you're on intel hardware. Look at badblocks on linux, or whatever OS you're on, for mapping out bad drive blocks. If your machine is dying with NO ping response, it has serious problems, and postgresql is just revealing them. Good luck on troubleshooting this problem. On Sun, 15 Dec 2002, Henrik Steffen wrote: > > Hello all, > > sometimes I experience a total crash of my > db-server while e.g. doing automated maintainance tasks: > > At 2:30 am every night the webserver is shut > down, so there won't be any concurrent accesses to the > db-server. then there will be done a > VACUUM FULL > > This is what happened tonight while fully vacuuming: > > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > Then, the script selects all user tables and starts > reindexing them. Tonight, reindexeing the first table > started and seconds later the whole server crashed. > > No ping, nothing else possible.... > > This is the list of recent crashes: > Tonight 02:42 am > Yesterday night 02:39 am > Tuesday at 10:34 am > Last saturday at 10:44 am > Last Tuesday at 02:19 am > The saturday before at 04:01 am > The thursday before at 04:02 am > the tuesday before at 02:25 am > > Always complete crashes... only reset helped. > > Most crashes occur while maintainance tasks. > However, there are some other crashes, too. > > There are never any hints in /var/log/messages > > I upgraded to postgresql 7.3 recently, but it doesn't > seem to help either. > > I am almost desperate. > > We are running some mysql-servers here, too, and I > more and more often try to imagine to move my whole > system to a mysql-server... my collegues NEVER have > had such trouble with their mysql-servers yet.... > > Do you have any hints for me? What can I do? My last > choice would be to move to mysql, but I am almost > desperate.... > > thanks for your help > > -- > > Mit freundlichem Gruß > > Henrik Steffen > Geschäftsführer > > top concepts Internetmarketing GmbH > Am Steinkamp 7 - D-21684 Stade - Germany > -------------------------------------------------------- > http://www.topconcepts.com Tel. +49 4141 991230 > mail: steffen@topconcepts.com Fax. +49 4141 991233 > -------------------------------------------------------- > 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) > -------------------------------------------------------- > Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de > System-Partner gesucht: http://www.franchise.city-map.de > -------------------------------------------------------- > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 > -------------------------------------------------------- > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
henrik, i had the exact same problem as well earlier this year on a dual xeon. the problem ended up being memory. even though we had registered ecc memory, that didn't make any difference. --brian On Mon, 2002-12-16 at 09:23, scott.marlowe wrote: > Henrik, have you tested the memory and drive subsystems on this machine? > > All this sounds very much like how an old server of mine was behaving when > I had bad memory. No database can be expected to perform reliably on > unreliable hardware. > > Look for memtest86 if you're on intel hardware. Look at badblocks on > linux, or whatever OS you're on, for mapping out bad drive blocks. > > If your machine is dying with NO ping response, it has serious problems, > and postgresql is just revealing them. > > Good luck on troubleshooting this problem. > > On Sun, 15 Dec 2002, Henrik Steffen wrote: > > > > > Hello all, > > > > sometimes I experience a total crash of my > > db-server while e.g. doing automated maintainance tasks: > > > > At 2:30 am every night the webserver is shut > > down, so there won't be any concurrent accesses to the > > db-server. then there will be done a > > VACUUM FULL > > > > This is what happened tonight while fully vacuuming: > > > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > Then, the script selects all user tables and starts > > reindexing them. Tonight, reindexeing the first table > > started and seconds later the whole server crashed. > > > > No ping, nothing else possible.... > > > > This is the list of recent crashes: > > Tonight 02:42 am > > Yesterday night 02:39 am > > Tuesday at 10:34 am > > Last saturday at 10:44 am > > Last Tuesday at 02:19 am > > The saturday before at 04:01 am > > The thursday before at 04:02 am > > the tuesday before at 02:25 am > > > > Always complete crashes... only reset helped. > > > > Most crashes occur while maintainance tasks. > > However, there are some other crashes, too. > > > > There are never any hints in /var/log/messages > > > > I upgraded to postgresql 7.3 recently, but it doesn't > > seem to help either. > > > > I am almost desperate. > > > > We are running some mysql-servers here, too, and I > > more and more often try to imagine to move my whole > > system to a mysql-server... my collegues NEVER have > > had such trouble with their mysql-servers yet.... > > > > Do you have any hints for me? What can I do? My last > > choice would be to move to mysql, but I am almost > > desperate.... > > > > thanks for your help > > > > -- > > > > Mit freundlichem Gruß > > > > Henrik Steffen > > Geschäftsführer > > > > top concepts Internetmarketing GmbH > > Am Steinkamp 7 - D-21684 Stade - Germany > > -------------------------------------------------------- > > http://www.topconcepts.com Tel. +49 4141 991230 > > mail: steffen@topconcepts.com Fax. +49 4141 991233 > > -------------------------------------------------------- > > 24h-Support Hotline: +49 1908 34697 (EUR 1.86/Min,topc) > > -------------------------------------------------------- > > Ihr SMS-Gateway: JETZT NEU unter: http://sms.city-map.de > > System-Partner gesucht: http://www.franchise.city-map.de > > -------------------------------------------------------- > > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563 > > -------------------------------------------------------- > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 6: Have you searched our list archives? > > > > http://archives.postgresql.org > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -- Brian Hirt <bhirt@mobygames.com>
Nigel J. Andrews wrote: > Are you using a 'lazy' memory allocation setup. You could find that suddenly > finding the requested memory isn't really there when told it was when > requesting it has nasty effects. But this wouldn't cause the kernel to crash. The kernel might start killing processes, possibly randomly, in an effort to free memory (and others would die of their own accord as their attempts to allocate memory fail), but it shouldn't cause the kernel itself to hang or crash. -- Kevin Brown kevin@sysexperts.com
Henrik Steffen wrote: > In my opinion, postgresql does something on my machine > that leads to these complete system lockups. PostgreSQL might beat on the disk subsystem hard enough to show faults in it, or perhaps it uses enough CPU that the CPU isn't being cooled properly anymore, etc. But regardless, that only means that PostgreSQL is a trigger, not an actual root cause. And it means that you will almost certainly have problems even after switching database engines. You mentioned that you're using a hardware RAID controller. There is always the possibility that the driver for that controller isn't entirely stable. If you have an identical box you can drop in place, I highly recommend that you do so. I'm betting that your problems will disappear after you do that. -- Kevin Brown kevin@sysexperts.com
I concur with this, I had *exactly* this problem. My hardware vendor overclocked my intel cpu, which was fine when it was an NT box because NT thrashes on the disk. But when running postgres on Linux on that machine (we had to put more hardware behind NT) the hardware test utilities all showed good hardware, but there were random bit errors that went away when I removed the overclocking. NT never encountered that because it was choking on disk I/O, not on CPU cycles. Terry Fielder Manager Software Development and Deployment Great Gulf Homes / Ashton Woods Homes terry@greatgulfhomes.com > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kevin Brown > Sent: Monday, December 16, 2002 6:15 PM > To: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Total crash of my db-server > > > Henrik Steffen wrote: > > In my opinion, postgresql does something on my machine > > that leads to these complete system lockups. > > PostgreSQL might beat on the disk subsystem hard enough to show faults > in it, or perhaps it uses enough CPU that the CPU isn't being cooled > properly anymore, etc. > > But regardless, that only means that PostgreSQL is a trigger, not an > actual root cause. And it means that you will almost certainly have > problems even after switching database engines. > > You mentioned that you're using a hardware RAID controller. There is > always the possibility that the driver for that controller isn't > entirely stable. > > If you have an identical box you can drop in place, I highly recommend > that you do so. I'm betting that your problems will disappear after > you do that. > > > -- > Kevin Brown kevin@sysexperts.com > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to > majordomo@postgresql.org >
This recent thread about a server crashing got me to thinking of server acceptance testing. When you are faced with the daunting task of testing a server, you should be trying to break it. Honestly, this is the most common mistake I see, if folks ordering a new server and simply assuming there's no problems with it. Assume all hardware is bad until you've proven to yourself otherwise. No at what point your hardware will be brought to it's knees (or worse) before your users can do that to you. Here are a few good tests for bad hardware that I've found, if anyone else has any, please chip in. Note that not all failures are deterministic and repeatable. Some show up very seldomly, or only when the server room is above 70 degress. It's easy to know when you've got a big problem with your hardware, but often hard to see the little ones. The first thing I test with is compiling the linux kernel AND / OR compiling Postgresql. Both are complex projects that stress the system fairly well. Toss in a '-j 8' setting and watch the machine chew up memory and CPU time. It's easy to write a script that basically does a make clean;make over several iterations and stores the md5sum of the outputted make data. They should all be the same. Set the box up to compile the linux kernel 1000 times over the weekend. Check the md5s, see if you have a few different. i've seen boxes with bad memory compile the linux kernel 10 or 20 times before generating an error. most of the time a bad memory module is obvious, sometimes not. memtest86 is pretty good. It too, can miss a bad memory location if the memory is right MOST of the time, but sometimes flakes out on you. but you may need to run it multiple times. Copy HUGE files across your drive arrays, and md5sum them at the beginning and end. The md5sum should always match, if it doesn't match, even just once out of hundreds of copies, your machine has a problem. Make sure you machine can operate reliably at the temperatures it may have to experience. I've seen plenty of servers that run fine in a nice cold room (say 60 degrees F or less) but failed when the temp rose 5 or 10 degrees. A server that fails at 72 degrees F consistently is too heat sensitive to be reliable over the long haul. Remember that dust collecting and age make electronics more susceptable to heat failure, so a new server that fails at 72, might fail at 70 next year, and 68 the year after that. I know I'm missing lots, so feel free to join it. The two most important concepts for server acceptance testing: 1: Assume it is broken. 2: Try to prove it is broken. That way, when it DOES work, you'll be pleasantly surprised, which is way better than assuming it works and finding out during production that your new server has issues. An aside: Many newer users get upset when they get told they must have bad hardware, because Postgresql just doesn't act like that. But it's true, Postgresql doesn't just act flakey. This reminds me of my favorite saying: "When you hear hoofbeats, don't think Zebra!" Loosely translated, when your postgresql box starts acting up, don't think it's postgresql's "fault" because it almost never is.
I also believe when buying servers, spend the extra money and buy quality servers. Our new cpq DL380 G2 has redunant everything.... mem,cpu,bios,fans,controllers,drives,nics. It costs a little(lot) extra, but for me it's ALWAYS paid in the long run. What kind of server is this that keeps crashing? Did I read this thread right earlier, this system has Raid 1 "IDE" drives? Must be a new direction in server class machines? Just the other night I wrote a bad sql statement that was interesting in that it would blow up postgres! It would chew cpu @ 100% then, slowy chew up all available memory, and then move on to chew up all available swap space, and finally you would end up with a "killed" process. Hey, what can I say I had to run it several more times just to see how postgres, linux and the hardware handled the whole thing but it never, locked up the hardware. Had a couple of processes left over that I had to kill by doing a pg_ctl fast restart but that was it. > The two most important concepts for server acceptance testing: > > 1: Assume it is broken. > 2: Try to prove it is broken. > > That way, when it DOES work, you'll be pleasantly surprised, which is way > better than assuming it works and finding out during production that your > new server has issues. >
I used the cpuburn program too ( http://users.ev1.net/~redelm/ ). It REALLY heats up the processor - interesting to watch the +5 volt immediately drop significantly (using healthd -d). The other voltages also change accordingly. The test doesn't touch files unlike a kernel recompile, so if you find you have a flaky system there's a lower chance of a corrupted filesystem. Plus compiling doesn't put as much load on my CPU - the +5V doesn't drop as much. I suspect there's not as much FPU access whilst compiling. And the FPU units are significant power consumers. Not sure what to use for testing P4s tho (there isn't a cpuburn test specifically for P4s). I'm using an Athlon XP so I use the burnK7 program. Good luck, Link. At 05:12 PM 12/16/02 -0700, scott.marlowe wrote: >This recent thread about a server crashing got me to thinking of server >acceptance testing. > >When you are faced with the daunting task of testing a server, you should >be trying to break it. Honestly, this is the most common mistake I see, >if folks ordering a new server and simply assuming there's no problems >with it. Assume all hardware is bad until you've proven to yourself >otherwise. No at what point your hardware will be brought to it's knees >(or worse) before your users can do that to you. > >Here are a few good tests for bad hardware that I've found, if anyone else >has any, please chip in. Note that not all failures are deterministic and >repeatable. Some show up very seldomly, or only when the server room is >above 70 degress. It's easy to know when you've got a big problem with >your hardware, but often hard to see the little ones. > >The first thing I test with is compiling the linux kernel AND / OR >compiling Postgresql. Both are complex projects that stress the system >fairly well. Toss in a '-j 8' setting and watch the machine chew up >memory and CPU time.
On Tue, 17 Dec 2002, Lincoln Yeoh wrote: > I used the cpuburn program too ( http://users.ev1.net/~redelm/ ). It REALLY > heats up the processor - interesting to watch the +5 volt immediately drop > significantly (using healthd -d). The other voltages also change accordingly. > > The test doesn't touch files unlike a kernel recompile, so if you find you > have a flaky system there's a lower chance of a corrupted filesystem. > > Plus compiling doesn't put as much load on my CPU - the +5V doesn't drop as > much. I suspect there's not as much FPU access whilst compiling. And the > FPU units are significant power consumers. > > Not sure what to use for testing P4s tho (there isn't a cpuburn test > specifically for P4s). I'm using an Athlon XP so I use the burnK7 program. I've used quake II as a good CPU cooker as well. any good FPS (First Person Shooter) usually cranks up the heat on the CPU. Plus it is fun to leave Quake on your new dual AMD 2400 MP system in the server room in demo mode for a week or so to burn it in. Of course, what we're all saying is how to beat your server like a mule BEFORE it goes into production. :-)
On 17 Dec 2002 at 8:40, scott.marlowe wrote: > Of course, what we're all saying is how to beat your server like a mule > BEFORE it goes into production. :-) And before it's warranty ends.. Rather it is an learning exercise to play with PU whose warranty is about to end.. Bye Shridhar -- scribline, n.: The blank area on the back of credit cards where one's signature goes. -- "Sniglets", Rich Hall & Friends