Thread: Database server restarting
Hello Everybody,
We are using postgressql 7.2.2 . our system running is 24 hours day it a preventive reboot once a day.some time I am getting this error and after it the sytem hang .Can any body help in this.
DEBUG: pq_recvbuf: unexpected EOF on client connection
DEBUG: pq_recvbuf: unexpected EOF on client connection
DEBUG: pq_recvbuf: unexpected EOF on client connection
DEBUG: pq_recvbuf: unexpected EOF on client connection
DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT
DEBUG: checkpoint record is at 3/85EA18B0
DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown FALSE
DEBUG: next transaction id: 4111285; next oid: 7557242
DEBUG: database system was not properly shut down; automatic recovery in progress
DEBUG: ReadRecord: record with zero length at 3/85EA18F0
DEBUG: redo is not required
DEBUG: recycled transaction log file 0000000300000083
DEBUG: recycled transaction log file 0000000300000084
DEBUG: database system is ready
DEBUG: pq_recvbuf: unexpected EOF on client connection
Regards
Shoaib
On Mon, 5 May 2003, shoaib wrote: > Hello Everybody, > > We are using postgressql 7.2.2 . our system running is 24 hours day it a > preventive reboot once a day. Odd concept. What is this reboot preventing? > some time I am getting this error and after > it the sytem hang .Can any body help in this. > > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT > DEBUG: checkpoint record is at 3/85EA18B0 > DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown > FALSE > DEBUG: next transaction id: 4111285; next oid: 7557242 > DEBUG: database system was not properly shut down; automatic recovery > in progress It looks like your preventative daily reboot is not preventing the problems it is causing. It is possible that the postmaster is not being shutdown properly because, for example, there is a client still connected and the shutdown script isn't forcing a fast shutdown. See pg_ctl manpage for infomation on the switches. As for worrying about the messages, there's no real error message in there, aside from the 'EOF on client connection', just the normal messages on start up from a bad shutdown. If you're worried, I would look at solving whatever the answer to the daily reboot question shows is the problem. > DEBUG: ReadRecord: record with zero length at 3/85EA18F0 > DEBUG: redo is not required > DEBUG: recycled transaction log file 0000000300000083 > DEBUG: recycled transaction log file 0000000300000084 > DEBUG: database system is ready > DEBUG: pq_recvbuf: unexpected EOF on client connection > > > Regards > > Shoaib > > -- Nigel J. Andrews
Thanks a lot for your prompt reply. We are rebooting the server for cleaning up the buffers of the system.Before rebooting I will shutdown database server.Can you provide any futher clue why suddenly at 4.17 aM it restarted.Our preventive maintenance run at 1 AM. And another process of Reading data from some flat files and updating it to database ended at 4.13 AM on the same day. Your help is really appreciated. Regards Shoaib -----Original Message----- From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] Sent: Monday, May 05, 2003 7:08 PM To: shoaib Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting On Mon, 5 May 2003, shoaib wrote: > Hello Everybody, > > We are using postgressql 7.2.2 . our system running is 24 hours day it a > preventive reboot once a day. Odd concept. What is this reboot preventing? > some time I am getting this error and after > it the sytem hang .Can any body help in this. > > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT > DEBUG: checkpoint record is at 3/85EA18B0 > DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown > FALSE > DEBUG: next transaction id: 4111285; next oid: 7557242 > DEBUG: database system was not properly shut down; automatic recovery > in progress It looks like your preventative daily reboot is not preventing the problems it is causing. It is possible that the postmaster is not being shutdown properly because, for example, there is a client still connected and the shutdown script isn't forcing a fast shutdown. See pg_ctl manpage for infomation on the switches. As for worrying about the messages, there's no real error message in there, aside from the 'EOF on client connection', just the normal messages on start up from a bad shutdown. If you're worried, I would look at solving whatever the answer to the daily reboot question shows is the problem. > DEBUG: ReadRecord: record with zero length at 3/85EA18F0 > DEBUG: redo is not required > DEBUG: recycled transaction log file 0000000300000083 > DEBUG: recycled transaction log file 0000000300000084 > DEBUG: database system is ready > DEBUG: pq_recvbuf: unexpected EOF on client connection > > > Regards > > Shoaib > > -- Nigel J. Andrews
On Mon, 5 May 2003, shoaib wrote: > Thanks a lot for your prompt reply. > We are rebooting the server for cleaning up the buffers of the > system.Before rebooting I will shutdown database server.Can you provide > any futher clue why suddenly at 4.17 aM it restarted.Our preventive > maintenance run at 1 AM. > And another process of Reading data from some flat files and updating it > to database ended at 4.13 AM on the same day. Hmmm...I assumed the 4:17 was from the scheduled reboot. It's a more difficult issue if that was from the postmaster exiting by itself. Did the data loading process end normally? It's a good few minutes but in the scheme of things 4 minutes for the postmaster to be restarted automatically may be isn't such a long time. I'm still drawn to this daily reboot process though. You do it to clean up the system buffers. Why? Is there perhaps some instability in the system if the system uses lots of memory? What is the hardware/os? Have you run hardware diagnostics? If it's Intel/PC like there is a program called memtest86 which is good at checking the memory. Be warned though, if you need that 24 hour up time to run memtest86 properly you're going to lose a good few hours. > -----Original Message----- > From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] > Sent: Monday, May 05, 2003 7:08 PM > To: shoaib > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Mon, 5 May 2003, shoaib wrote: > > > Hello Everybody, > > > > We are using postgressql 7.2.2 . our system running is 24 hours day it > a > > preventive reboot once a day. > > Odd concept. What is this reboot preventing? > > > > some time I am getting this error and after > > it the sytem hang .Can any body help in this. > > > > DEBUG: pq_recvbuf: unexpected EOF on client connection > > DEBUG: pq_recvbuf: unexpected EOF on client connection > > DEBUG: pq_recvbuf: unexpected EOF on client connection > > DEBUG: pq_recvbuf: unexpected EOF on client connection > > DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT > > DEBUG: checkpoint record is at 3/85EA18B0 > > DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown > > FALSE > > DEBUG: next transaction id: 4111285; next oid: 7557242 > > DEBUG: database system was not properly shut down; automatic recovery > > in progress > > It looks like your preventative daily reboot is not preventing the > problems it > is causing. It is possible that the postmaster is not being shutdown > properly > because, for example, there is a client still connected and the shutdown > script > isn't forcing a fast shutdown. See pg_ctl manpage for infomation on the > switches. > > As for worrying about the messages, there's no real error message in > there, > aside from the 'EOF on client connection', just the normal messages on > start up > from a bad shutdown. If you're worried, I would look at solving whatever > the > answer to the daily reboot question shows is the problem. > > > > DEBUG: ReadRecord: record with zero length at 3/85EA18F0 > > DEBUG: redo is not required > > DEBUG: recycled transaction log file 0000000300000083 > > DEBUG: recycled transaction log file 0000000300000084 > > DEBUG: database system is ready > > DEBUG: pq_recvbuf: unexpected EOF on client connection > > > > > > Regards > > > > Shoaib > > > > > > -- Nigel J. Andrews
Modern OS's shouldn' need rebooting, unless something else is wrong. What's the quality of your hardware? Any applicationscompiled on bad hardware? sigh, is it a windows environment? shoaib wrote: > Thanks a lot for your prompt reply. > We are rebooting the server for cleaning up the buffers of the > system.Before rebooting I will shutdown database server.Can you provide > any futher clue why suddenly at 4.17 aM it restarted.Our preventive > maintenance run at 1 AM. > And another process of Reading data from some flat files and updating it > to database ended at 4.13 AM on the same day. > Your help is really appreciated. > > Regards > Shoaib > > -----Original Message----- > From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] > Sent: Monday, May 05, 2003 7:08 PM > To: shoaib > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Mon, 5 May 2003, shoaib wrote: > > >>Hello Everybody, >> >>We are using postgressql 7.2.2 . our system running is 24 hours day it > > a > >>preventive reboot once a day. > > > Odd concept. What is this reboot preventing? > > > >>some time I am getting this error and after >>it the sytem hang .Can any body help in this. >> >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT >>DEBUG: checkpoint record is at 3/85EA18B0 >>DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown >>FALSE >>DEBUG: next transaction id: 4111285; next oid: 7557242 >>DEBUG: database system was not properly shut down; automatic recovery >>in progress > > > It looks like your preventative daily reboot is not preventing the > problems it > is causing. It is possible that the postmaster is not being shutdown > properly > because, for example, there is a client still connected and the shutdown > script > isn't forcing a fast shutdown. See pg_ctl manpage for infomation on the > switches. > > As for worrying about the messages, there's no real error message in > there, > aside from the 'EOF on client connection', just the normal messages on > start up > from a bad shutdown. If you're worried, I would look at solving whatever > the > answer to the daily reboot question shows is the problem. > > > >>DEBUG: ReadRecord: record with zero length at 3/85EA18F0 >>DEBUG: redo is not required >>DEBUG: recycled transaction log file 0000000300000083 >>DEBUG: recycled transaction log file 0000000300000084 >>DEBUG: database system is ready >>DEBUG: pq_recvbuf: unexpected EOF on client connection >> >> >>Regards >> >>Shoaib >> >> > >
Our server reboots at 1 aM in the morning and the job I mentioned starts at 4 aM in the morning and the job ended at 4.13 AM. This process is database extensive around 10000 records are updated / inserted.Can it be the cause of this problem. After this thing happened my server just hangs. Last night I faced the same problem again on another server and it was after yet another DB extensive process. The system has 1 GB RAM, 1 GHZ processor and RAID 1 installed on it and Red Hat linux 7.3. We are about to install 70 such servers. Please help. regards Shoaib -----Original Message----- From: Dennis Gearon [mailto:gearond@cvc.net] Sent: Monday, May 05, 2003 11:13 PM To: shoaib Cc: 'Nigel J. Andrews'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting Modern OS's shouldn' need rebooting, unless something else is wrong. What's the quality of your hardware? Any applications compiled on bad hardware? sigh, is it a windows environment? shoaib wrote: > Thanks a lot for your prompt reply. > We are rebooting the server for cleaning up the buffers of the > system.Before rebooting I will shutdown database server.Can you provide > any futher clue why suddenly at 4.17 aM it restarted.Our preventive > maintenance run at 1 AM. > And another process of Reading data from some flat files and updating it > to database ended at 4.13 AM on the same day. > Your help is really appreciated. > > Regards > Shoaib > > -----Original Message----- > From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] > Sent: Monday, May 05, 2003 7:08 PM > To: shoaib > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Mon, 5 May 2003, shoaib wrote: > > >>Hello Everybody, >> >>We are using postgressql 7.2.2 . our system running is 24 hours day it > > a > >>preventive reboot once a day. > > > Odd concept. What is this reboot preventing? > > > >>some time I am getting this error and after >>it the sytem hang .Can any body help in this. >> >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: pq_recvbuf: unexpected EOF on client connection >>DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT >>DEBUG: checkpoint record is at 3/85EA18B0 >>DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown >>FALSE >>DEBUG: next transaction id: 4111285; next oid: 7557242 >>DEBUG: database system was not properly shut down; automatic recovery >>in progress > > > It looks like your preventative daily reboot is not preventing the > problems it > is causing. It is possible that the postmaster is not being shutdown > properly > because, for example, there is a client still connected and the shutdown > script > isn't forcing a fast shutdown. See pg_ctl manpage for infomation on the > switches. > > As for worrying about the messages, there's no real error message in > there, > aside from the 'EOF on client connection', just the normal messages on > start up > from a bad shutdown. If you're worried, I would look at solving whatever > the > answer to the daily reboot question shows is the problem. > > > >>DEBUG: ReadRecord: record with zero length at 3/85EA18F0 >>DEBUG: redo is not required >>DEBUG: recycled transaction log file 0000000300000083 >>DEBUG: recycled transaction log file 0000000300000084 >>DEBUG: database system is ready >>DEBUG: pq_recvbuf: unexpected EOF on client connection >> >> >>Regards >> >>Shoaib >> >> > >
On Tue, May 06, 2003 at 01:41:37PM +0800, shoaib wrote: > Our server reboots at 1 aM in the morning and the job I mentioned starts > at 4 aM in the morning and the job ended at 4.13 AM. This process is > database extensive around 10000 records are updated / inserted.Can it be > the cause of this problem. After this thing happened my server just > hangs. When you say hang, do you mean the entire server stops responding ie you can't login any more, no web requests, etc..? If so, it's got nothing to do with postgres as a user program simply can't hang the machine like that (unless you run out of memory in which case it's just really slow rather hung). > Last night I faced the same problem again on another server and it was > after yet another DB extensive process. > The system has 1 GB RAM, 1 GHZ processor and RAID 1 installed on it and > Red Hat linux 7.3. > We are about to install 70 such servers. When oyu say reboot, are you doing to proper shutdown sequence (shutdown -r now) or are you just pulling the plug. Please explain what "hangs". Also, rebooting everyday seems to be a massive waste of time. UNIX machines don't need that kind of maintainence. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Attachment
When I say hangs it means ..I am not even able to login at the server console also. No ssh, no login form remote machines. Regards Shoaib -----Original Message----- From: Martijn van Oosterhout [mailto:kleptog@svana.org] Sent: Tuesday, May 06, 2003 2:15 PM To: shoaib Cc: gearond@cvc.net; 'Nigel J. Andrews'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting On Tue, May 06, 2003 at 01:41:37PM +0800, shoaib wrote: > Our server reboots at 1 aM in the morning and the job I mentioned starts > at 4 aM in the morning and the job ended at 4.13 AM. This process is > database extensive around 10000 records are updated / inserted.Can it be > the cause of this problem. After this thing happened my server just > hangs. When you say hang, do you mean the entire server stops responding ie you can't login any more, no web requests, etc..? If so, it's got nothing to do with postgres as a user program simply can't hang the machine like that (unless you run out of memory in which case it's just really slow rather hung). > Last night I faced the same problem again on another server and it was > after yet another DB extensive process. > The system has 1 GB RAM, 1 GHZ processor and RAID 1 installed on it and > Red Hat linux 7.3. > We are about to install 70 such servers. When oyu say reboot, are you doing to proper shutdown sequence (shutdown -r now) or are you just pulling the plug. Please explain what "hangs". Also, rebooting everyday seems to be a massive waste of time. UNIX machines don't need that kind of maintainence. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
On Tuesday 06 May 2003 11:11, shoaib wrote: > Our server reboots at 1 aM in the morning and the job I mentioned starts > at 4 aM in the morning and the job ended at 4.13 AM. This process is > database extensive around 10000 records are updated / inserted.Can it be > the cause of this problem. After this thing happened my server just > hangs. > > Last night I faced the same problem again on another server and it was > after yet another DB extensive process. > The system has 1 GB RAM, 1 GHZ processor and RAID 1 installed on it and > Red Hat linux 7.3. > We are about to install 70 such servers. I am sure there is something not very correct here. You should not need a server restart. I would like to see your database configuration options, patterns in data access and min/max/avg load on each server. 10K records isn't much. Certainly not for that kind of hardware.. I am still bothered by the fact that you reboot your server daily. Can't find a good reason from above description.. Shridhar
On Tue, May 06, 2003 at 02:28:57PM +0800, shoaib wrote: > When I say hangs it means ..I am not even able to login at the server > console also. > No ssh, no login form remote machines. Well, that's not postgresql's fault. It can't hang a machine like that. You should look elsewhere for the exact cause. I'm assuming here that consoles that are still logged in don't respond either? Maybe leave a top running to capture the list of processes just before it dies? Any cronjobs about the time it dies? What other processes run at about that time? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
Attachment
There are some cron jobs running at the same time... One server does SSH into our application server and on cron job is reading the DB and writing some data into flat files. But by the time this problem is happening these jobs are not writing any data. Last night when the server went down the other server wa trying to do SsH and probably it was running some cron job and a heavy DB process was running.I can not do a top bcoz I can not login into server even from console. Regards shaoib -----Original Message----- From: Martijn van Oosterhout [mailto:kleptog@svana.org] Sent: Tuesday, May 06, 2003 2:40 PM To: shoaib Cc: gearond@cvc.net; 'Nigel J. Andrews'; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting On Tue, May 06, 2003 at 02:28:57PM +0800, shoaib wrote: > When I say hangs it means ..I am not even able to login at the server > console also. > No ssh, no login form remote machines. Well, that's not postgresql's fault. It can't hang a machine like that. You should look elsewhere for the exact cause. I'm assuming here that consoles that are still logged in don't respond either? Maybe leave a top running to capture the list of processes just before it dies? Any cronjobs about the time it dies? What other processes run at about that time? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > "the West won the world not by the superiority of its ideas or values or > religion but rather by its superiority in applying organized violence. > Westerners often forget this fact, non-Westerners never do." > - Samuel P. Huntington
On Tuesday 06 May 2003 12:16, shoaib wrote: > There are some cron jobs running at the same time... > One server does SSH into our application server and on cron job is > reading the DB and writing some data into flat files. But by the time > this problem is happening these jobs are not writing any data. Last > night when the server went down the other server wa trying to do SsH and > probably it was running some cron job and a heavy DB process was > running.I can not do a top bcoz I can not login into server even from > console. How much time did you wait? If server has doing heavy disk processing, it would take upto 10 minutes under worst conditions.. Just don't give up in a minute or so.. Shridhar
On Tue, 6 May 2003, shoaib wrote: > There are some cron jobs running at the same time... > One server does SSH into our application server and on cron job is > reading the DB and writing some data into flat files. But by the time > this problem is happening these jobs are not writing any data. Last > night when the server went down the other server wa trying to do SsH and > probably it was running some cron job and a heavy DB process was > running.I can not do a top bcoz I can not login into server even from > console. Do you mean you have no log in priviledges on to the machine or you are only trying to login once you see a problem? If the former then I can't see how there's any way you can make progress with this. If the later, forget that, that's not helping since you are unable to get the processes running. What you should do is log in _now_, run 'top' and leave it running. It may be that when the problem occurs the session running the top will stop and so show the information from that time. However, it may also be that it doesn't stop and when you come into the office n hours later you find it merrily ticking away showing you the current information. Therefore, investigate ways to log the information if you aren't sat there when the problem is occuring. Also take a look at procinfo, it may be helpful as well. One thing that might be a problem is the number of open file descriptors, you could be running into the system limit of those. That sort of thing can sometimes make a system unstable. I'd still be interested to know whether the hardware has been tested properly. Is there any known problems for RH 7.3's kernel and your particular hardware, such as the RAID device? One interesting thing you say though; the same thing happens on a second server. That to me suggests either something like a kernel/hardware problem such as the RAID or you have a bug in your own software. Perhaps an endless loop? Perhaps an endless trying to obtain a file descriptor? A heavy cpu usage process shouldn't bring the machine down but it can make it look very unresponsive. > > Regards > shaoib > > > -----Original Message----- > From: Martijn van Oosterhout [mailto:kleptog@svana.org] > Sent: Tuesday, May 06, 2003 2:40 PM > To: shoaib > Cc: gearond@cvc.net; 'Nigel J. Andrews'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Tue, May 06, 2003 at 02:28:57PM +0800, shoaib wrote: > > When I say hangs it means ..I am not even able to login at the server > > console also. > > No ssh, no login form remote machines. > > Well, that's not postgresql's fault. It can't hang a machine like that. > You > should look elsewhere for the exact cause. I'm assuming here that > consoles > that are still logged in don't respond either? Maybe leave a top running > to > capture the list of processes just before it dies? Any cronjobs about > the > time it dies? > > What other processes run at about that time? > -- Nigel J. Andrews
When I login a console, I can see the prompt but after typing in login name system just don't respond it does not come to password prompt. Regards Shoaib -----Original Message----- From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] Sent: Tuesday, May 06, 2003 3:44 PM To: shoaib Cc: 'Martijn van Oosterhout'; gearond@cvc.net; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting On Tue, 6 May 2003, shoaib wrote: > There are some cron jobs running at the same time... > One server does SSH into our application server and on cron job is > reading the DB and writing some data into flat files. But by the time > this problem is happening these jobs are not writing any data. Last > night when the server went down the other server wa trying to do SsH and > probably it was running some cron job and a heavy DB process was > running.I can not do a top bcoz I can not login into server even from > console. Do you mean you have no log in priviledges on to the machine or you are only trying to login once you see a problem? If the former then I can't see how there's any way you can make progress with this. If the later, forget that, that's not helping since you are unable to get the processes running. What you should do is log in _now_, run 'top' and leave it running. It may be that when the problem occurs the session running the top will stop and so show the information from that time. However, it may also be that it doesn't stop and when you come into the office n hours later you find it merrily ticking away showing you the current information. Therefore, investigate ways to log the information if you aren't sat there when the problem is occuring. Also take a look at procinfo, it may be helpful as well. One thing that might be a problem is the number of open file descriptors, you could be running into the system limit of those. That sort of thing can sometimes make a system unstable. I'd still be interested to know whether the hardware has been tested properly. Is there any known problems for RH 7.3's kernel and your particular hardware, such as the RAID device? One interesting thing you say though; the same thing happens on a second server. That to me suggests either something like a kernel/hardware problem such as the RAID or you have a bug in your own software. Perhaps an endless loop? Perhaps an endless trying to obtain a file descriptor? A heavy cpu usage process shouldn't bring the machine down but it can make it look very unresponsive. > > Regards > shaoib > > > -----Original Message----- > From: Martijn van Oosterhout [mailto:kleptog@svana.org] > Sent: Tuesday, May 06, 2003 2:40 PM > To: shoaib > Cc: gearond@cvc.net; 'Nigel J. Andrews'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Tue, May 06, 2003 at 02:28:57PM +0800, shoaib wrote: > > When I say hangs it means ..I am not even able to login at the server > > console also. > > No ssh, no login form remote machines. > > Well, that's not postgresql's fault. It can't hang a machine like that. > You > should look elsewhere for the exact cause. I'm assuming here that > consoles > that are still logged in don't respond either? Maybe leave a top running > to > capture the list of processes just before it dies? Any cronjobs about > the > time it dies? > > What other processes run at about that time? > -- Nigel J. Andrews
On Tue, 6 May 2003, shoaib wrote: > When I login a console, I can see the prompt but after typing in login > name system just don't respond it does not come to password prompt. You may have to wait a long time which isn't very good because a) by the time the system has enough resources to proceed with your log in it's not in the same state it was in at the problem time (obviously) and b) the login process may well timeout the login attempt before it even gets to the stage of asking for the password. You really do need to be logged in before the problem occurs. Indeed, have more than one session running, run system monitoring utilities like top and procinfo and also one you can type into without stopping those utilities. If you can get the system to again you may also find it useful to run your cronjobs by hand to verify them individually and to then try and replicate the early morning conditions at whatever time you can test things. If you're having to wait overnight everytime just to take a look at a new piece of the puzzle you're locked into that timetable for generating and testing a solution. -- Nigel Andrews
Hello, Thanks for you kind help. But is there any particular reason for database to do such kind of behavior. DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: pq_recvbuf: unexpected EOF on client connection DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT DEBUG: checkpoint record is at 3/85EA18B0 DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown FALSE DEBUG: next transaction id: 4111285; next oid: 7557242 DEBUG: database system was not properly shut down; automatic recovery in progress DEBUG: ReadRecord: record with zero length at 3/85EA18F0 DEBUG: redo is not required DEBUG: recycled transaction log file 0000000300000083 DEBUG: recycled transaction log file 0000000300000084 DEBUG: database system is ready DEBUG: pq_recvbuf: unexpected EOF on client connection Is there any particular reason for this thing. Regards Shoaib -----Original Message----- From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] Sent: Tuesday, May 06, 2003 4:55 PM To: shoaib Cc: 'Martijn van Oosterhout'; gearond@cvc.net; pgsql-general@postgresql.org Subject: RE: [GENERAL] Database server restarting On Tue, 6 May 2003, shoaib wrote: > When I login a console, I can see the prompt but after typing in login > name system just don't respond it does not come to password prompt. You may have to wait a long time which isn't very good because a) by the time the system has enough resources to proceed with your log in it's not in the same state it was in at the problem time (obviously) and b) the login process may well timeout the login attempt before it even gets to the stage of asking for the password. You really do need to be logged in before the problem occurs. Indeed, have more than one session running, run system monitoring utilities like top and procinfo and also one you can type into without stopping those utilities. If you can get the system to again you may also find it useful to run your cronjobs by hand to verify them individually and to then try and replicate the early morning conditions at whatever time you can test things. If you're having to wait overnight everytime just to take a look at a new piece of the puzzle you're locked into that timetable for generating and testing a solution. -- Nigel Andrews
On Tue, 6 May 2003, shoaib wrote: > Hello, > > Thanks for you kind help. > > But is there any particular reason for database to do such kind of > behavior. > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: pq_recvbuf: unexpected EOF on client connection > DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT > DEBUG: checkpoint record is at 3/85EA18B0 > DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown > FALSE > DEBUG: next transaction id: 4111285; next oid: 7557242 > DEBUG: database system was not properly shut down; automatic recovery > in progress > DEBUG: ReadRecord: record with zero length at 3/85EA18F0 > DEBUG: redo is not required > DEBUG: recycled transaction log file 0000000300000083 > DEBUG: recycled transaction log file 0000000300000084 > DEBUG: database system is ready > DEBUG: pq_recvbuf: unexpected EOF on client connection > > Is there any particular reason for this thing. Well, there are probably lots of potential causes but consider something like this: process A starts up process A uses N MB of memory process A loops process A uses N+1 MB of memory ... process B starts up and connects to DB memory available is 1MB process A loops process A uses N+1 MB of memory proi process B wants 10KB more memory process B dies for want of memory allocation checks DB notes the unexpected EOF on the connection from B process A loops process A wants N+1 MB of memory process A retries N+1 MB of memory process A retries N+1 MB of memory process A retries N+1 MB of memory process A retries N+1 MB of memory ... system can't start any other process for lack of memory resources You've got high system load, inability for processes to claim more memory and errors about programs exiting at unexpected times. -- Nigel Andrews
On Tuesday 06 May 2003 14:25, Nigel J. Andrews wrote: > On Tue, 6 May 2003, shoaib wrote: > > When I login a console, I can see the prompt but after typing in login > > name system just don't respond it does not come to password prompt. > > You may have to wait a long time which isn't very good because a) by the > time the system has enough resources to proceed with your log in it's not > in the same state it was in at the problem time (obviously) and b) the > login process may well timeout the login attempt before it even gets to the > stage of asking for the password. I have two suggestions for OP, if he is interested in experimenting with alternatives, assuming problems is with heavy DB process. 1) Try freeBSD4.8 and postgresql from ports. I have a gut feeling that BSD would be more responsive under heavy disk load than linux. No concrete evidence.. just a gut feeling.. 2) Try a latest kernel.. I suggest you get 2.4.20 from kernel.org and apply patches from http://members.optusnet.com.au/ckolivas/kernel/. Just get the base patch that includes O(1), pre-empt and low-latency.. That should be good enough.. Basically with either of these, the irresponsiveness that you are facing should be gone and you should be able to debug the problem.. HTH Shridhar
"Nigel J. Andrews" <nandrews@investsystems.co.uk> writes: >> But is there any particular reason for database to do such kind of >> behavior. >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT >> DEBUG: checkpoint record is at 3/85EA18B0 > You've got high system load, inability for processes to claim more memory and > errors about programs exiting at unexpected times. What strikes me about the above trace is that we see "database system was interrupted" without any prior failure. That says to me that something killed the postmaster itself --- if a database child process died, the postmaster would have logged the fact. That leaves me with two questions: what killed the postmaster, and what restarted it? If Nigel's guess is right that the system is under heavy memory pressure, and this is a Linux box, then the kernel itself might have kill -9'd the postmaster to try to get out of a memory shortage. I can't think of very many other theories (though I do recall at least one self-inflicted problem, from someone whose "maintenance script" kill -9'd the postmaster for random reasons...) I'd also like to know whether the system is configured to auto-restart the postmaster, and if so how, and does it do any mucking about (like removing lockfiles) while it's doing so? regards, tom lane
Look in the archives about disk and memory testing. memtest86 and some other program. shoaib wrote: > Our server reboots at 1 aM in the morning and the job I mentioned starts > at 4 aM in the morning and the job ended at 4.13 AM. This process is > database extensive around 10000 records are updated / inserted.Can it be > the cause of this problem. After this thing happened my server just > hangs. > > Last night I faced the same problem again on another server and it was > after yet another DB extensive process. > The system has 1 GB RAM, 1 GHZ processor and RAID 1 installed on it and > Red Hat linux 7.3. > We are about to install 70 such servers. > > Please help. > > regards > Shoaib > > -----Original Message----- > From: Dennis Gearon [mailto:gearond@cvc.net] > Sent: Monday, May 05, 2003 11:13 PM > To: shoaib > Cc: 'Nigel J. Andrews'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > Modern OS's shouldn' need rebooting, unless something else is wrong. > What's the quality of your hardware? Any applications compiled on bad > hardware? > > sigh, is it a windows environment? > > shoaib wrote: > >>Thanks a lot for your prompt reply. >>We are rebooting the server for cleaning up the buffers of the >>system.Before rebooting I will shutdown database server.Can you > > provide > >>any futher clue why suddenly at 4.17 aM it restarted.Our preventive >>maintenance run at 1 AM. >>And another process of Reading data from some flat files and updating > > it > >>to database ended at 4.13 AM on the same day. >>Your help is really appreciated. >> >>Regards >>Shoaib >> >>-----Original Message----- >>From: Nigel J. Andrews [mailto:nandrews@investsystems.co.uk] >>Sent: Monday, May 05, 2003 7:08 PM >>To: shoaib >>Cc: pgsql-general@postgresql.org >>Subject: Re: [GENERAL] Database server restarting >> >>On Mon, 5 May 2003, shoaib wrote: >> >> >> >>>Hello Everybody, >>> >>>We are using postgressql 7.2.2 . our system running is 24 hours day it >> >>a >> >> >>>preventive reboot once a day. >> >> >>Odd concept. What is this reboot preventing? >> >> >> >> >>>some time I am getting this error and after >>>it the sytem hang .Can any body help in this. >>> >>>DEBUG: pq_recvbuf: unexpected EOF on client connection >>>DEBUG: pq_recvbuf: unexpected EOF on client connection >>>DEBUG: pq_recvbuf: unexpected EOF on client connection >>>DEBUG: pq_recvbuf: unexpected EOF on client connection >>>DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT >>>DEBUG: checkpoint record is at 3/85EA18B0 >>>DEBUG: redo record is at 3/85EA18B0; undo record is at 0/0; shutdown >>>FALSE >>>DEBUG: next transaction id: 4111285; next oid: 7557242 >>>DEBUG: database system was not properly shut down; automatic recovery >>>in progress >> >> >>It looks like your preventative daily reboot is not preventing the >>problems it >>is causing. It is possible that the postmaster is not being shutdown >>properly >>because, for example, there is a client still connected and the > > shutdown > >>script >>isn't forcing a fast shutdown. See pg_ctl manpage for infomation on > > the > >>switches. >> >>As for worrying about the messages, there's no real error message in >>there, >>aside from the 'EOF on client connection', just the normal messages on >>start up >>from a bad shutdown. If you're worried, I would look at solving > > whatever > >>the >>answer to the daily reboot question shows is the problem. >> >> >> >> >>>DEBUG: ReadRecord: record with zero length at 3/85EA18F0 >>>DEBUG: redo is not required >>>DEBUG: recycled transaction log file 0000000300000083 >>>DEBUG: recycled transaction log file 0000000300000084 >>>DEBUG: database system is ready >>>DEBUG: pq_recvbuf: unexpected EOF on client connection >>> >>> >>>Regards >>> >>>Shoaib >>> >>> >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > >
Here's a little script that will run top every so often and log the output to a file you can read later when the machine's recovered. #!/bin/bash for ((a=0;a=1;a=0)) do { top -bn 1 >>log.txt sleep 60 } Just run it in your home directory. Make sure your /home partition has enough space. Under heavy load each 60 seconds you'll be adding about 2k to 5k to that file. Change the sleep 60 to something smaller if you want it to run more often. No warranties implied, use at your own risk. :-) On Tue, 6 May 2003, shoaib wrote: > There are some cron jobs running at the same time... > One server does SSH into our application server and on cron job is > reading the DB and writing some data into flat files. But by the time > this problem is happening these jobs are not writing any data. Last > night when the server went down the other server wa trying to do SsH and > probably it was running some cron job and a heavy DB process was > running.I can not do a top bcoz I can not login into server even from > console. > > Regards > shaoib > > > -----Original Message----- > From: Martijn van Oosterhout [mailto:kleptog@svana.org] > Sent: Tuesday, May 06, 2003 2:40 PM > To: shoaib > Cc: gearond@cvc.net; 'Nigel J. Andrews'; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Database server restarting > > On Tue, May 06, 2003 at 02:28:57PM +0800, shoaib wrote: > > When I say hangs it means ..I am not even able to login at the server > > console also. > > No ssh, no login form remote machines. > > Well, that's not postgresql's fault. It can't hang a machine like that. > You > should look elsewhere for the exact cause. I'm assuming here that > consoles > that are still logged in don't respond either? Maybe leave a top running > to > capture the list of processes just before it dies? Any cronjobs about > the > time it dies? > > What other processes run at about that time? >
On Tue, 6 May 2003, shoaib wrote: > When I login a console, I can see the prompt but after typing in login > name system just don't respond it does not come to password prompt. FYI, for future reference, this is generally referred to as being non-responsive, not hanging. Hanging means the server has truly crashed, and is no longer answer pings, etc... Usually hanging servers mean bad hardware. Non-responsive servers often mean that you've increased the load too high for the server to handle, and it's busily swapping out resources left and right to try and stay up and running. And there is NO reason to reboot a RedHat Linux 7.x box every night. Mine routinely get 100 days of uptime between reboots, sometimes 200 days. Usually by then we're either upgrading to a new version or installing a new kernel and have to reboot. Leaving the OS up is actually a good thing, as it keeps the buffers from getting cleared out. Note that if all you want is for OS cache buffers to flush, just write a short c program that mallocs huge chunks of memory until you start swapping a bit. But that's counter productive. Postgresql flushes buffers when it's writing, so you don't have to worry about dataloss, and the data in those buffers takes a while to load. 11:42am up 36 days, 1:20, 4 users, load average: 0.27, 0.28, 0.32 195 processes: 194 sleeping, 1 running, 0 zombie, 0 stopped CPU0 states: 21.0% user, 0.0% system, 0.0% nice, 78.0% idle CPU1 states: 1.0% user, 8.0% system, 0.0% nice, 89.0% idle Mem: 1543980K av, 1535472K used, 8508K free, 265928K shrd, 48872K buff Swap: 2048208K av, 164524K used, 1883684K free 871720K cached Note the 870 Meg of cached data. It takes my server at least a day of running before it can use the extra memory as cache, and rebooting it would make it start over. Unlike Windows machines, Unix machines tend to run faster the longer they're left up.
On Tue, 6 May 2003, scott.marlowe wrote: > Here's a little script that will run top every so often and log the output > to a file you can read later when the machine's recovered. > > #!/bin/bash > for ((a=0;a=1;a=0)) do { > top -bn 1 >>log.txt > sleep 60 > } > > Just run it in your home directory. Make sure your /home partition has > enough space. Under heavy load each 60 seconds you'll be adding about 2k > to 5k to that file. Change the sleep 60 to something smaller if you want > it to run more often. No warranties implied, use at your own risk. :-) The problem with that is that it is starting up new processes each iteration. At the least you need to redirect stderr to the log file as well. Should top fail to launch then that would provide some help with the problem but not as much as actually having the output of top. It would be much better to just do a: top -d 60 -b -n 600 > log.txt 2>&1 which would take snapshots for 10 hours, or just set a very large number instead of 600 and interrupt it when wanted. The 60 second interval can easily be changed then as well. Then, of course, if the issue is disk activity, swap or otherwise, there's also vmstat. What about file descriptor usage? It's possible to determine an estimate of that by looking through /proc, in which case I'd say a simple shell script would suffice and never mind the possible failures to start programs like ls. Then what about if it's interrupt activity that's a problem? Not very likely on modern hardware but even 10Mbps ethernet could bring a system almost to it's knees with interrupt activity on older stuff. I think the important point in this is that there is something making the system unstable and the extra load produced by the postgresql cron jobs is sufficient to make that something significant where normally a daily reboot prevents it avoids it getting to that stage. So again, it's the question of 'why reboot daily?' -- Nigel Andrews
I am not using any restart script ( may be I understood u wrongly) But I am starting postgres at the time of system boot up and this is the script for that #! /bin/sh # # Startup script to run Postgresql # # start() { if [ `id -u` = 0 ] && ! echo $PATH | /bin/grep -q "/sbin" ; then PATH=/sbin:$PATH fi if [ `id -u` = 0 ] && ! echo $PATH | /bin/grep -q "/usr/sbin" ; then PATH=/usr/sbin:$PATH fi if [ `id -u` = 0 ] && ! echo $PATH | /bin/grep -q "/usr/local/sbin" ; then PATH=/usr/local/sbin:$PATH fi if ! echo $PATH | /bin/grep -q "/usr/X11R6/bin" ; then PATH="$PATH:/usr/X11R6/bin" fi PATH=$PATH:.:/usr/local/jdk/bin:/usr/local/pgsql/bin #FOR NON-RAID #PGDATA=/usr/local/pgsql/data #FOR RAID PGDATA=/data/pgsql/data export PATH PGDATA su -l postgres -s /bin/sh -c "/usr/local/pgsql/bin/pg_ctl start -D $PGDATA -o '-i' -s -l $PGDATA/simspgsql.log &" sleep 1 if [ -f $PGDATA/postmaster.pid ] then echo "PostgreSQL started" else echo "PostgreSQL not started" fi } stop() { su -l postgres -s /bin/sh -c "/usr/local/pgsql/bin/pg_ctl stop -D $PGDATA -s -m fast" sleep 1 if [ -f $PGDATA/postmaster.pid ] then echo "PostgreSQL not stopped" else echo "PostgreSQL is currently stopped" fi } restart() { stop start } status() { su -l postgres -s /bin/sh -c "/usr/local/pgsql/bin/pg_ctl status -D $PGDATA" } case "$1" in start) start ;; stop) stop ;; restart) restart ;; status) status ;; *) echo "Usage: $0 {start|stop|restart|status}" esac Let m know if there is any problem in it. Regards, Shoaib -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, May 06, 2003 10:22 PM To: Nigel J. Andrews Cc: shoaib; 'Martijn van Oosterhout'; gearond@cvc.net; pgsql-general@postgresql.org Subject: Re: [GENERAL] Database server restarting "Nigel J. Andrews" <nandrews@investsystems.co.uk> writes: >> But is there any particular reason for database to do such kind of >> behavior. >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: pq_recvbuf: unexpected EOF on client connection >> DEBUG: database system was interrupted at 2003-05-03 04:17:19 SGT >> DEBUG: checkpoint record is at 3/85EA18B0 > You've got high system load, inability for processes to claim more memory and > errors about programs exiting at unexpected times. What strikes me about the above trace is that we see "database system was interrupted" without any prior failure. That says to me that something killed the postmaster itself --- if a database child process died, the postmaster would have logged the fact. That leaves me with two questions: what killed the postmaster, and what restarted it? If Nigel's guess is right that the system is under heavy memory pressure, and this is a Linux box, then the kernel itself might have kill -9'd the postmaster to try to get out of a memory shortage. I can't think of very many other theories (though I do recall at least one self-inflicted problem, from someone whose "maintenance script" kill -9'd the postmaster for random reasons...) I'd also like to know whether the system is configured to auto-restart the postmaster, and if so how, and does it do any mucking about (like removing lockfiles) while it's doing so? regards, tom lane