Thread: Server stops responding randomly for 5 minutes

Server stops responding randomly for 5 minutes

From
"Andrus"
Date:
Sometimes per week server stops randomly responding for approx 5 minutes.
User  should wait for 5 minutes before server responds.
Other users can work normally at same time.

Monday this happens at  12:16
I havent noticed anythis special in PostgreSql and windows logs at this
time.

How to fix or find reason for this ?

Andrus.

Config:

Win 2008 server  2 GB  RAM
PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit
10-25 users manually entering sales order, e.q. low server usage

Server is installed using setup program from postgresql.org directed site,
postresql.conf file is not modified after installation.

Database size: 2165 MB

Biggest files in decreasing order of size:

      1  firma1.bilkaib                                880 MB     table
      2  public.attachme                               454 MB     table
      3  firma1.rid                                    401 MB     table
      4  firma1.omrid                                  88 MB      table
      5  firma1.omdok                                  79 MB      table
      6  firma1.bilkaib_cr_idx                         75 MB      index
      7  firma1.bilkaib_cr_pattern_idx                 74 MB      index
      8  firma1.bilkaib_db_pattern_idx                 74 MB      index
      9  firma1.bilkaib_db_idx                         74 MB      index
     10  firma1.klient                                 69 MB      table
     11  firma1.dok                                    64 MB      table
     12  firma1.bilkaib_kuupaev_idx                    60 MB      index
     13  firma1.bilkaib_dokumnr_idx                    53 MB      index
     14  firma1.bilkaib_pkey                           43 MB      index
     15  firma1.rid_rtellimus_idx                      38 MB      index
     16  firma1.rid_toode_idx                          25 MB      index
     17  firma1.rid_toode_pattern_idx                  24 MB      index
     18  public.strings                                23 MB      table
     19  firma1.rid_inpdokumnr_idx                     19 MB      index
     20  firma1.toode                                  17 MB      table
     21  firma1.rid_dokumnr_idx                        16 MB      index
     22  firma1.rid_pkey                               16 MB      index
     23  firma1.summav                                 15 MB      table
     24  public.report                                 13 MB      table
     25  public.desktop                                9784 kB    table
     26  public.mailbox                                7128 kB    table
     27  public.localfil                               5584 kB    table
     28  pg_toast.pg_toast_36145_index                 5392 kB    index
...


Re: Server stops responding randomly for 5 minutes

From
Tony Wang
Date:
I think logs may help. Have you checked that?

2011/7/13 Andrus <kobruleht2@hot.ee>
Sometimes per week server stops randomly responding for approx 5 minutes.
User  should wait for 5 minutes before server responds.
Other users can work normally at same time.

Monday this happens at  12:16
I havent noticed anythis special in PostgreSql and windows logs at this time.

How to fix or find reason for this ?

Andrus.

Config:

Win 2008 server  2 GB  RAM
PostgreSQL 9.0.4, compiled by Visual C++ build 1500, 64-bit
10-25 users manually entering sales order, e.q. low server usage

Server is installed using setup program from postgresql.org directed site,
postresql.conf file is not modified after installation.

Database size: 2165 MB

Biggest files in decreasing order of size:

    1  firma1.bilkaib                                880 MB     table
    2  public.attachme                               454 MB     table
    3  firma1.rid                                    401 MB     table
    4  firma1.omrid                                  88 MB      table
    5  firma1.omdok                                  79 MB      table
    6  firma1.bilkaib_cr_idx                         75 MB      index
    7  firma1.bilkaib_cr_pattern_idx                 74 MB      index
    8  firma1.bilkaib_db_pattern_idx                 74 MB      index
    9  firma1.bilkaib_db_idx                         74 MB      index
   10  firma1.klient                                 69 MB      table
   11  firma1.dok                                    64 MB      table
   12  firma1.bilkaib_kuupaev_idx                    60 MB      index
   13  firma1.bilkaib_dokumnr_idx                    53 MB      index
   14  firma1.bilkaib_pkey                           43 MB      index
   15  firma1.rid_rtellimus_idx                      38 MB      index
   16  firma1.rid_toode_idx                          25 MB      index
   17  firma1.rid_toode_pattern_idx                  24 MB      index
   18  public.strings                                23 MB      table
   19  firma1.rid_inpdokumnr_idx                     19 MB      index
   20  firma1.toode                                  17 MB      table
   21  firma1.rid_dokumnr_idx                        16 MB      index
   22  firma1.rid_pkey                               16 MB      index
   23  firma1.summav                                 15 MB      table
   24  public.report                                 13 MB      table
   25  public.desktop                                9784 kB    table
   26  public.mailbox                                7128 kB    table
   27  public.localfil                               5584 kB    table
   28  pg_toast.pg_toast_36145_index                 5392 kB    index
...


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Server stops responding randomly for 5 minutes

From
"Tomáš Vondra"
Date:
> Sometimes per week server stops randomly responding for approx 5 minutes.
> User  should wait for 5 minutes before server responds.
> Other users can work normally at same time.

So does the whole machine just stop responding, or just the postgresql?
Are those other users using postgresql or some other services?

> Monday this happens at  12:16
> I havent noticed anythis special in PostgreSql and windows logs at this
> time.

Have you done some basic monitoring? This typically happens when the
machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this
is logged.

> How to fix or find reason for this ?

First, you have to determine what's wrong. Set up some basic monitoring,
on Linux I'd use iostat/vmstat, not sure about the windows - try process
explorer from sysinternals.

> 10-25 users manually entering sales order, e.q. low server usage

So they're entering the data directly into the database? Are you sure
there's not something wrong in the application (e.g. a loop that takes a
lot of time in some cases)?

> Server is installed using setup program from postgresql.org directed site,
> postresql.conf file is not modified after installation.

Not sure what setup program you mean (there's an install from
EnterpriseDB), but the default config values are usually too low (e.g. the
default shared buffers is 24MB IIRC, but in your case 512MB would be
probably better). Not sure if that's the problem, though.

Tomas


Re: Server stops responding randomly for 5 minutes

From
Peter Geoghegan
Date:
2011/7/13 Andrus <kobruleht2@hot.ee>:
> Sometimes per week server stops randomly responding for approx 5 minutes.
> User  should wait for 5 minutes before server responds.
> Other users can work normally at same time.

Sounds very much like a locking issue. Are you doing something like
storing a frequently updated system-wide setting in a table with a
single row?

http://wiki.postgresql.org/wiki/Lock_Monitoring

When the problem happens again, report back what the top query you see
on that wiki page shows.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Re: Server stops responding randomly for 5 minutes

From
"Andrus"
Date:
Tomáš and Tony,

thank you.

> Have you done some basic monitoring? This typically happens when the
> machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this
> is logged.

This is dedicated server, used only for PostgreSql.

I filtered windows event logs near this time (12:16) . The is error

The OpsMgr Connector could not connect to OPSMAN.yyyyy.xxx:5723.  The error
code is 11004L(The requested name is valid, but no data of the requested
type was found.).  Please verify there is network connectivity, the server
is running and has registered it's listening port, and there are no
firewalls blocking traffic to the destination.

and OpsMgr related warnings. I do'nt know what is OpsMgr but expect that
this is not realted to this issue.

postgres log at this time shows

2011-07-11 11:45:27 EEST LOG:  could not receive data from client: No
connection could be made because the target machine actively refused it.

2011-07-11 11:45:27 EEST LOG:  unexpected EOF on client connection
2011-07-11 12:18:35 EEST LOG:  could not receive data from client: No
connection could be made because the target machine actively refused it.


2011-07-11 12:18:35 EEST LOG:  unexpected EOF on client connection
2011-07-11 12:18:46 EEST LOG:  could not receive data from client: No
connection could be made because the target machine actively refused it.


2011-07-11 12:18:46 EEST LOG:  unexpected EOF on client connection
2011-07-11 12:20:55 EEST LOG:  could not receive data from client: No
connection could be made because the target machine actively refused it.

those messages appear all the time so I expect that they are also not
related this issue.

> First, you have to determine what's wrong. Set up some basic monitoring,
> on Linux I'd use iostat/vmstat, not sure about the windows - try process
> explorer from sysinternals.

Users report hangup later, I cannot monitor this at hangup time.
No idea what / how to monitor more. Anyway I added lines

log_lock_waits = on
log_temp_files = 2000
log_min_duration_statement = 10000
log_line_prefix='%t %u %d '
log_min_error_statement = warning

to end of postgresql.conf and restarted server.

>> 10-25 users manually entering sales order, e.q. low server usage
>
> So they're entering the data directly into the database?

Try are using windows application from RDP (other computer in LAN) which
generates insert, update, delete commands to server for every entered order.

> Are you sure
> there's not something wrong in the application (e.g. a loop that takes a
> lot of time in some cases)?

This application is used for may other sites without hangup.
I havent written indentionally such loops, statements cannot take so much
time.

>> Server is installed using setup program from postgresql.org directed
>> site,
>> postresql.conf file is not modified after installation.
>
> Not sure what setup program you mean (there's an install from
> EnterpriseDB), but the default config values are usually too low (e.g. the
> default shared buffers is 24MB IIRC, but in your case 512MB would be
> probably better). Not sure if that's the problem, though.

PostgreSql was installed from EnterpriceDB, postgresql.org site contains
link into it.

I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it
for mixed app server and restarted service.

postgresql.conf now contains

# NOTE: This has been modified by EnterpriseDB's Tuning Wizard on 2011/07/13
14:10:42
#       Original Value for "shared_buffers" was "32MB"
shared_buffers = 69608   # min 128kB

Andrus.


Re: Server stops responding randomly for 5 minutes

From
"Tomas Vondra"
Date:
On 13 Červenec 2011, 13:34, Andrus wrote:
> 2011-07-11 12:18:35 EEST LOG:  unexpected EOF on client connection
> 2011-07-11 12:18:46 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.
>
>
> 2011-07-11 12:18:46 EEST LOG:  unexpected EOF on client connection
> 2011-07-11 12:20:55 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.

AFAIK this happens when the connection is not closed properly (e.g. the
application exits without closing the connection etc.).

>> First, you have to determine what's wrong. Set up some basic monitoring,
>> on Linux I'd use iostat/vmstat, not sure about the windows - try process
>> explorer from sysinternals.
>
> Users report hangup later, I cannot monitor this at hangup time.
> No idea what / how to monitor more. Anyway I added lines
>
> log_lock_waits = on
> log_temp_files = 2000
> log_min_duration_statement = 10000
> log_line_prefix='%t %u %d '
> log_min_error_statement = warning

OK, but you need to monitor the system too. See for example the Process
Explorer - you can run it and then see the history when the users report
there was a problem.

> I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it
> for mixed app server and restarted service.

> shared_buffers = 69608   # min 128kB

OK, that's 540MB - pretty close to 512MB I have recommended.

Tomas


Re: Server stops responding randomly for 5 minutes

From
"Andrus"
Date:
Peter,

>> Sometimes per week server stops randomly responding for approx 5 minutes.
>> User should wait for 5 minutes before server responds.
>> Other users can work normally at same time.
>Sounds very much like a locking issue.

Thank you.
This may be the issue.

> Are you doing something like
>storing a frequently updated system-wide setting in a table with a
>single row?
>http://wiki.postgresql.org/wiki/Lock_Monitoring

AFAIK Order entry should not update single row global parameter setting
table. Order id and numbers, order detail row ids and row order numbers  are
read from PostgreSql sequence tables using nextval. Hopefully this cannot
cause the issue.
Specific for this site there every order entry invokes lot of tiny updates
for single row foreign key fields on large rid table which are shown below.
Not sure that they cause delay.

>When the problem happens again, report back what the top query you see
> on that wiki page shows.

Issue is reported by users later, I cannot access to server at the time when
issue occurs. Users can report exact time when issue occured.
How to set additional logging for find reason for this after issue is
reported?

I optimized postgresql.conf to mized server using Tuning Wizard and  added

log_lock_waits = on
log_temp_files = 2000
log_min_duration_statement = 10000
log_line_prefix='%t %u %d '
log_min_error_statement = warning

to end of postgresql.conf

How to tune logging so that reason is logged if this happens again ??

Andrus.


Order entry updates:

update rid SET rid7obj='.', reakuupaev=?dok.kuupaev where (toode like 'KM%'
or toode like 'TT%') and
   dokumnr= <<TRANSFORM(dok.dokumnr)>>;
update rid SET rid7obj='.' where toode like 'YM%' and
dokumnr=<<TRANSFORM(dok.dokumnr)>>;
update rid SET reakuupaev=?dok.kuupaev where toode like 'TT%' and
dokumnr=<<TRANSFORM(dok.dokumnr)>>;

update rid SET rid6obj='XX-XXX' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and
toode like 'TT%' ;
update rid SET rid6obj='XX-XXX' where dokumnr=<<TRANSFORM(dok.dokumnr)>> and
toode like 'K%' ;
update rid SET rid6obj='XX-XXX'   where  dokumnr=<<TRANSFORM(dok.dokumnr)>>
and toode like 'Y%' ;

update rid SET myygikood='E' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MLE%';
update rid SET myygikood='Y' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MLE%' and rid6obj like 'EU%' ;
update rid SET rid3obj='MNOEU' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MLE%';
update rid SET rid3obj='MEURO' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MLE%' and rid6obj like 'EU%' ;

update rid SET myygikood='E' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MMA%';
update rid SET rid3obj='MNOEU' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MMA%';
update rid SET myygikood='Y' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MMA%' and rid6obj like 'EU%';
update rid SET rid3obj='MEURO' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MMA%' and rid6obj like 'EU%';

update rid SET myygikood='Y' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MPU%';
update rid SET rid3obj='MNOEU' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MPU%';
update rid SET rid3obj='MEURO' where  dokumnr=<<TRANSFORM(dok.dokumnr)>> and
rid2obj like 'MPU%' and rid6obj like 'EU%';



Re: Server stops responding randomly for 5 minutes

From
Chris Travers
Date:
I think your log files, and the description of your issue provide at
least a decent idea of what is probably happening.  Unfortunately
without being able to observe the server when it happens, I think you
are going to have limited success tracking this down.

Proximal causes could be network errors or software bugs.

On Wed, Jul 13, 2011 at 4:34 AM, Andrus <kobruleht2@hot.ee> wrote:
> Tomáš and Tony,
>
> thank you.
>
>> Have you done some basic monitoring? This typically happens when the
>> machine does a lot of I/O (swapping, checkpoints, ...) - not sure how this
>> is logged.
>
> This is dedicated server, used only for PostgreSql.
>
> I filtered windows event logs near this time (12:16) . The is error
>
> The OpsMgr Connector could not connect to OPSMAN.yyyyy.xxx:5723.  The error
> code is 11004L(The requested name is valid, but no data of the requested
> type was found.).  Please verify there is network connectivity, the server
> is running and has registered it's listening port, and there are no
> firewalls blocking traffic to the destination.
>
> and OpsMgr related warnings. I do'nt know what is OpsMgr but expect that
> this is not realted to this issue.

It's a product called Microsoft Operations Manager or MOM for short.

The error could be related even if the software is not.  I wouldn't
toss it out entirely yet.
>
> postgres log at this time shows
>
> 2011-07-11 11:45:27 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.
>
> 2011-07-11 11:45:27 EEST LOG:  unexpected EOF on client connection
> 2011-07-11 12:18:35 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.

Sounds like evidence of a network problem or unexpected applicatoin termination.

What I suspect is actually happening (although this is based on sparse
evidence, it may be a useful starting point) is that a session which
is locking certain records is getting spuriously disconnected and the
session is timing out along with locks.  During this timeout period,
the transaction is not committed or rolled back and the locks are
still there.  However that's hardly enough to solve the problem.

Given that you are running the application over RDP it would be
interesting to find out whether these happen when the RDP session is
closed.  Maybe the application is killed and has locks that aren't
released until it times out on the server?  Of course this only causes
an issue in the cases where there is a request for the same locks by
another user when the RDP session closes.


>
> 2011-07-11 12:18:35 EEST LOG:  unexpected EOF on client connection
> 2011-07-11 12:18:46 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.
>
>
> 2011-07-11 12:18:46 EEST LOG:  unexpected EOF on client connection
> 2011-07-11 12:20:55 EEST LOG:  could not receive data from client: No
> connection could be made because the target machine actively refused it.
>
> those messages appear all the time so I expect that they are also not
> related this issue.

They could be though.  If you have a lock conflict and a connection
timeout that could be causing your problem.
>
>> First, you have to determine what's wrong. Set up some basic monitoring,
>> on Linux I'd use iostat/vmstat, not sure about the windows - try process
>> explorer from sysinternals.
>
> Users report hangup later, I cannot monitor this at hangup time.
> No idea what / how to monitor more. Anyway I added lines
>
> log_lock_waits = on
> log_temp_files = 2000
> log_min_duration_statement = 10000
> log_line_prefix='%t %u %d '
> log_min_error_statement = warning

add %r also to your log line prefix.
>
> to end of postgresql.conf and restarted server.
>
>>> 10-25 users manually entering sales order, e.q. low server usage
>>
>> So they're entering the data directly into the database?
>
> Try are using windows application from RDP (other computer in LAN) which
> generates insert, update, delete commands to server for every entered order.

What happens when an RDP session times out or is disconnected without
closing an application?

I.e. if a user just closes the remote desktop session without closing
the app first, the app is killed, right?

>
>> Are you sure
>> there's not something wrong in the application (e.g. a loop that takes a
>> lot of time in some cases)?
>
> This application is used for may other sites without hangup.
> I havent written indentionally such loops, statements cannot take so much
> time.

How many others are running over RDP?
>
>>> Server is installed using setup program from postgresql.org directed
>>> site,
>>> postresql.conf file is not modified after installation.
>>
>> Not sure what setup program you mean (there's an install from
>> EnterpriseDB), but the default config values are usually too low (e.g. the
>> default shared buffers is 24MB IIRC, but in your case 512MB would be
>> probably better). Not sure if that's the problem, though.
>
> PostgreSql was installed from EnterpriceDB, postgresql.org site contains
> link into it.
>
> I ran now EnterpiseDb Tuning wizard and optimized postgresql.conf using it
> for mixed app server and restarted service.
>
> postgresql.conf now contains
>
> # NOTE: This has been modified by EnterpriseDB's Tuning Wizard on 2011/07/13
> 14:10:42
> #       Original Value for "shared_buffers" was "32MB"
> shared_buffers = 69608   # min 128kB
>
Hope this provides a starting point that gets you somewhere.

Best wishes,
Chris Travers