Thread: Server stops responding randomly for 5 minutes
Sometimes per week server stops randomly responding for approx 5 minutes. User should wait for 5 minutes before server responds. Other users can work normally at same time. Monday this happens at 12:16 I havent noticed anythis special in PostgreSql and windows logs at this time. How to fix or find reason for this ? Andrus. Config: Win 2008 server 2 GB RAM PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit 10-25 users manually entering sales order, e.q. low server usage Server is installed using setup program from postgresql.org directed site, postresql.conf file is not modified after installation. Database size: 2165 MB Biggest files in decreasing order of size: 1 firma1.bilkaib 880 MB table 2 public.attachme 454 MB table 3 firma1.rid 401 MB table 4 firma1.omrid 88 MB table 5 firma1.omdok 79 MB table 6 firma1.bilkaib_cr_idx 75 MB index 7 firma1.bilkaib_cr_pattern_idx 74 MB index 8 firma1.bilkaib_db_pattern_idx 74 MB index 9 firma1.bilkaib_db_idx 74 MB index 10 firma1.klient 69 MB table 11 firma1.dok 64 MB table 12 firma1.bilkaib_kuupaev_idx 60 MB index 13 firma1.bilkaib_dokumnr_idx 53 MB index 14 firma1.bilkaib_pkey 43 MB index 15 firma1.rid_rtellimus_idx 38 MB index 16 firma1.rid_toode_idx 25 MB index 17 firma1.rid_toode_pattern_idx 24 MB index 18 public.strings 23 MB table 19 firma1.rid_inpdokumnr_idx 19 MB index 20 firma1.toode 17 MB table 21 firma1.rid_dokumnr_idx 16 MB index 22 firma1.rid_pkey 16 MB index 23 firma1.summav 15 MB table 24 public.report 13 MB table 25 public.desktop 9784 kB table 26 public.mailbox 7128 kB table 27 public.localfil 5584 kB table 28 pg_toast.pg_toast_36145_index 5392 kB index ...
I think logs may help. Have you checked that?
2011/7/13 Andrus <kobruleht2@hot.ee>
Sometimes per week server stops randomly responding for approx 5 minutes.
User should wait for 5 minutes before server responds.
Other users can work normally at same time.
Monday this happens at 12:16
I havent noticed anythis special in PostgreSql and windows logs at this time.
How to fix or find reason for this ?
Andrus.
Config:
Win 2008 server 2 GB RAM
PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit
10-25 users manually entering sales order, e.q. low server usage
Server is installed using setup program from postgresql.org directed site,
postresql.conf file is not modified after installation.
Database size: 2165 MB
Biggest files in decreasing order of size:
1 firma1.bilkaib 880 MB table
2 public.attachme 454 MB table
3 firma1.rid 401 MB table
4 firma1.omrid 88 MB table
5 firma1.omdok 79 MB table
6 firma1.bilkaib_cr_idx 75 MB index
7 firma1.bilkaib_cr_pattern_idx 74 MB index
8 firma1.bilkaib_db_pattern_idx 74 MB index
9 firma1.bilkaib_db_idx 74 MB index
10 firma1.klient 69 MB table
11 firma1.dok 64 MB table
12 firma1.bilkaib_kuupaev_idx 60 MB index
13 firma1.bilkaib_dokumnr_idx 53 MB index
14 firma1.bilkaib_pkey 43 MB index
15 firma1.rid_rtellimus_idx 38 MB index
16 firma1.rid_toode_idx 25 MB index
17 firma1.rid_toode_pattern_idx 24 MB index
18 public.strings 23 MB table
19 firma1.rid_inpdokumnr_idx 19 MB index
20 firma1.toode 17 MB table
21 firma1.rid_dokumnr_idx 16 MB index
22 firma1.rid_pkey 16 MB index
23 firma1.summav 15 MB table
24 public.report 13 MB table
25 public.desktop 9784 kB table
26 public.mailbox 7128 kB table
27 public.localfil 5584 kB table
28 pg_toast.pg_toast_36145_index 5392 kB index
...
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
> Sometimes per week server stops randomly responding for approx 5 minutes. > User should wait for 5 minutes before server responds. > Other users can work normally at same time. So does the whole machine just stop responding, or just the postgresql? Are those other users using postgresql or some other services? > Monday this happens at 12:16 > I havent noticed anythis special in PostgreSql and windows logs at this > time. Have you done some basic monitoring? This typically happens when the machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this is logged. > How to fix or find reason for this ? First, you have to determine what's wrong. Set up some basic monitoring, on Linux I'd use iostat/vmstat, not sure about the windows - try process explorer from sysinternals. > 10-25 users manually entering sales order, e.q. low server usage So they're entering the data directly into the database? Are you sure there's not something wrong in the application (e.g. a loop that takes a lot of time in some cases)? > Server is installed using setup program from postgresql.org directed site, > postresql.conf file is not modified after installation. Not sure what setup program you mean (there's an install from EnterpriseDB), but the default config values are usually too low (e.g. the default shared buffers is 24MB IIRC, but in your case 512MB would be probably better). Not sure if that's the problem, though. Tomas
2011/7/13 Andrus <kobruleht2@hot.ee>: > Sometimes per week server stops randomly responding for approx 5 minutes. > User should wait for 5 minutes before server responds. > Other users can work normally at same time. Sounds very much like a locking issue. Are you doing something like storing a frequently updated system-wide setting in a table with a single row? http://wiki.postgresql.org/wiki/Lock_Monitoring When the problem happens again, report back what the top query you see on that wiki page shows. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Tomáš and Tony, thank you. > Have you done some basic monitoring? This typically happens when the > machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this > is logged. This is dedicated server, used only for PostgreSql. I filtered windows event logs near this time (12:16) . The is error The OpsMgr Connector could not connect to OPSMAN.yyyyy.xxx:5723. The error code is 11004L(The requested name is valid, but no data of the requested type was found.). Please verify there is network connectivity, the server is running and has registered it's listening port, and there are no firewalls blocking traffic to the destination. and OpsMgr related warnings. I do'nt know what is OpsMgr but expect that this is not realted to this issue. postgres log at this time shows 2011-07-11 11:45:27 EEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. 2011-07-11 11:45:27 EEST LOG: unexpected EOF on client connection 2011-07-11 12:18:35 EEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. 2011-07-11 12:18:35 EEST LOG: unexpected EOF on client connection 2011-07-11 12:18:46 EEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. 2011-07-11 12:18:46 EEST LOG: unexpected EOF on client connection 2011-07-11 12:20:55 EEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. those messages appear all the time so I expect that they are also not related this issue. > First, you have to determine what's wrong. Set up some basic monitoring, > on Linux I'd use iostat/vmstat, not sure about the windows - try process > explorer from sysinternals. Users report hangup later, I cannot monitor this at hangup time. No idea what / how to monitor more. Anyway I added lines log_lock_waits = on log_temp_files = 2000 log_min_duration_statement = 10000 log_line_prefix='%t %u %d ' log_min_error_statement = warning to end of postgresql.conf and restarted server. >> 10-25 users manually entering sales order, e.q. low server usage > > So they're entering the data directly into the database? Try are using windows application from RDP (other computer in LAN) which generates insert, update, delete commands to server for every entered order. > Are you sure > there's not something wrong in the application (e.g. a loop that takes a > lot of time in some cases)? This application is used for may other sites without hangup. I havent written indentionally such loops, statements cannot take so much time. >> Server is installed using setup program from postgresql.org directed >> site, >> postresql.conf file is not modified after installation. > > Not sure what setup program you mean (there's an install from > EnterpriseDB), but the default config values are usually too low (e.g. the > default shared buffers is 24MB IIRC, but in your case 512MB would be > probably better). Not sure if that's the problem, though. PostgreSql was installed from EnterpriceDB, postgresql.org site contains link into it. I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it for mixed app server and restarted service. postgresql.conf now contains # NOTE: This has been modified by EnterpriseDB's Tuning Wizard on 2011/07/13 14:10:42 # Original Value for "shared_buffers" was "32MB" shared_buffers = 69608 # min 128kB Andrus.
On 13 Červenec 2011, 13:34, Andrus wrote: > 2011-07-11 12:18:35 EEST LOG: unexpected EOF on client connection > 2011-07-11 12:18:46 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. > > > 2011-07-11 12:18:46 EEST LOG: unexpected EOF on client connection > 2011-07-11 12:20:55 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. AFAIK this happens when the connection is not closed properly (e.g. the application exits without closing the connection etc.). >> First, you have to determine what's wrong. Set up some basic monitoring, >> on Linux I'd use iostat/vmstat, not sure about the windows - try process >> explorer from sysinternals. > > Users report hangup later, I cannot monitor this at hangup time. > No idea what / how to monitor more. Anyway I added lines > > log_lock_waits = on > log_temp_files = 2000 > log_min_duration_statement = 10000 > log_line_prefix='%t %u %d ' > log_min_error_statement = warning OK, but you need to monitor the system too. See for example the Process Explorer - you can run it and then see the history when the users report there was a problem. > I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it > for mixed app server and restarted service. > shared_buffers = 69608 # min 128kB OK, that's 540MB - pretty close to 512MB I have recommended. Tomas
Peter, >> Sometimes per week server stops randomly responding for approx 5 minutes. >> User should wait for 5 minutes before server responds. >> Other users can work normally at same time. >Sounds very much like a locking issue. Thank you. This may be the issue. > Are you doing something like >storing a frequently updated system-wide setting in a table with a >single row? >http://wiki.postgresql.org/wiki/Lock_Monitoring AFAIK Order entry should not update single row global parameter setting table. Order id and numbers, order detail row ids and row order numbers are read from PostgreSql sequence tables using nextval. Hopefully this cannot cause the issue. Specific for this site there every order entry invokes lot of tiny updates for single row foreign key fields on large rid table which are shown below. Not sure that they cause delay. >When the problem happens again, report back what the top query you see > on that wiki page shows. Issue is reported by users later, I cannot access to server at the time when issue occurs. Users can report exact time when issue occured. How to set additional logging for find reason for this after issue is reported? I optimized postgresql.conf to mized server using Tuning Wizard and added log_lock_waits = on log_temp_files = 2000 log_min_duration_statement = 10000 log_line_prefix='%t %u %d ' log_min_error_statement = warning to end of postgresql.conf How to tune logging so that reason is logged if this happens again ?? Andrus. Order entry updates: update rid SET rid7obj='.', reakuupaev=?dok.kuupaev where (toode like 'KM%' or toode like 'TT%') and dokumnr= <<TRANSFORM(dok.dokumnr)>>; update rid SET rid7obj='.' where toode like 'YM%' and dokumnr=<<TRANSFORM(dok.dokumnr)>>; update rid SET reakuupaev=?dok.kuupaev where toode like 'TT%' and dokumnr=<<TRANSFORM(dok.dokumnr)>>; update rid SET rid6obj='XX-XXX' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and toode like 'TT%' ; update rid SET rid6obj='XX-XXX' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and toode like 'K%' ; update rid SET rid6obj='XX-XXX' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and toode like 'Y%' ; update rid SET myygikood='E' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MLE%'; update rid SET myygikood='Y' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MLE%' and rid6obj like 'EU%' ; update rid SET rid3obj='MNOEU' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MLE%'; update rid SET rid3obj='MEURO' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MLE%' and rid6obj like 'EU%' ; update rid SET myygikood='E' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MMA%'; update rid SET rid3obj='MNOEU' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MMA%'; update rid SET myygikood='Y' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MMA%' and rid6obj like 'EU%'; update rid SET rid3obj='MEURO' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MMA%' and rid6obj like 'EU%'; update rid SET myygikood='Y' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MPU%'; update rid SET rid3obj='MNOEU' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MPU%'; update rid SET rid3obj='MEURO' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and rid2obj like 'MPU%' and rid6obj like 'EU%';
I think your log files, and the description of your issue provide at least a decent idea of what is probably happening. Unfortunately without being able to observe the server when it happens, I think you are going to have limited success tracking this down. Proximal causes could be network errors or software bugs. On Wed, Jul 13, 2011 at 4:34 AM, Andrus <kobruleht2@hot.ee> wrote: > Tomáš and Tony, > > thank you. > >> Have you done some basic monitoring? This typically happens when the >> machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this >> is logged. > > This is dedicated server, used only for PostgreSql. > > I filtered windows event logs near this time (12:16) . The is error > > The OpsMgr Connector could not connect to OPSMAN.yyyyy.xxx:5723. The error > code is 11004L(The requested name is valid, but no data of the requested > type was found.). Please verify there is network connectivity, the server > is running and has registered it's listening port, and there are no > firewalls blocking traffic to the destination. > > and OpsMgr related warnings. I do'nt know what is OpsMgr but expect that > this is not realted to this issue. It's a product called Microsoft Operations Manager or MOM for short. The error could be related even if the software is not. I wouldn't toss it out entirely yet. > > postgres log at this time shows > > 2011-07-11 11:45:27 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. > > 2011-07-11 11:45:27 EEST LOG: unexpected EOF on client connection > 2011-07-11 12:18:35 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. Sounds like evidence of a network problem or unexpected applicatoin termination. What I suspect is actually happening (although this is based on sparse evidence, it may be a useful starting point) is that a session which is locking certain records is getting spuriously disconnected and the session is timing out along with locks. During this timeout period, the transaction is not committed or rolled back and the locks are still there. However that's hardly enough to solve the problem. Given that you are running the application over RDP it would be interesting to find out whether these happen when the RDP session is closed. Maybe the application is killed and has locks that aren't released until it times out on the server? Of course this only causes an issue in the cases where there is a request for the same locks by another user when the RDP session closes. > > 2011-07-11 12:18:35 EEST LOG: unexpected EOF on client connection > 2011-07-11 12:18:46 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. > > > 2011-07-11 12:18:46 EEST LOG: unexpected EOF on client connection > 2011-07-11 12:20:55 EEST LOG: could not receive data from client: No > connection could be made because the target machine actively refused it. > > those messages appear all the time so I expect that they are also not > related this issue. They could be though. If you have a lock conflict and a connection timeout that could be causing your problem. > >> First, you have to determine what's wrong. Set up some basic monitoring, >> on Linux I'd use iostat/vmstat, not sure about the windows - try process >> explorer from sysinternals. > > Users report hangup later, I cannot monitor this at hangup time. > No idea what / how to monitor more. Anyway I added lines > > log_lock_waits = on > log_temp_files = 2000 > log_min_duration_statement = 10000 > log_line_prefix='%t %u %d ' > log_min_error_statement = warning add %r also to your log line prefix. > > to end of postgresql.conf and restarted server. > >>> 10-25 users manually entering sales order, e.q. low server usage >> >> So they're entering the data directly into the database? > > Try are using windows application from RDP (other computer in LAN) which > generates insert, update, delete commands to server for every entered order. What happens when an RDP session times out or is disconnected without closing an application? I.e. if a user just closes the remote desktop session without closing the app first, the app is killed, right? > >> Are you sure >> there's not something wrong in the application (e.g. a loop that takes a >> lot of time in some cases)? > > This application is used for may other sites without hangup. > I havent written indentionally such loops, statements cannot take so much > time. How many others are running over RDP? > >>> Server is installed using setup program from postgresql.org directed >>> site, >>> postresql.conf file is not modified after installation. >> >> Not sure what setup program you mean (there's an install from >> EnterpriseDB), but the default config values are usually too low (e.g. the >> default shared buffers is 24MB IIRC, but in your case 512MB would be >> probably better). Not sure if that's the problem, though. > > PostgreSql was installed from EnterpriceDB, postgresql.org site contains > link into it. > > I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it > for mixed app server and restarted service. > > postgresql.conf now contains > > # NOTE: This has been modified by EnterpriseDB's Tuning Wizard on 2011/07/13 > 14:10:42 > # Original Value for "shared_buffers" was "32MB" > shared_buffers = 69608 # min 128kB > Hope this provides a starting point that gets you somewhere. Best wishes, Chris Travers