Thread: psql query gets stuck indefinitely
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
Hi AllI have postgres installed in cluster setup. My system has a script which executes the below query on remote system in cluster.psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"But somehow this query got stuck. It didnt return even after the remote system( on which this query was supposed to execute) is rebooted . What could be the reason ??
The issue will most likely be related to the network or to the client-side host. Perhaps the client machine changed IP addresses (maybe as part of a switch from WiFi to wired or similar) ?
Check the man page for psql in 9.1; I think client-side keepalive support got committed for 9.1 . If it didn't, you can always set it globally for all TCP/IP connections on your system. See eg http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html .
--
Craig Ringer
On 11/28/2011 05:30 PM, tamanna madaan wrote: > Hi All > I have postgres installed in cluster setup. My system has a script > which executes the below query on remote system in cluster. > psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;" > But somehow this query got stuck. It didnt return even after the remote > system( on which this query was supposed to execute) is rebooted . What > could be the reason ?? I relised just after sending my last message: You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use: ps -C psql -o wchan:80= or ps -p 1234 -o wchan:80= ... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example. If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql: gdb -q -p 1234 <<__END__ bt q __END__ If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake. You're looking for output that looks like: #1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6 #2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6 #3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6 #4 0x000000369d216065 in readline () from /lib64/libreadline.so.6 ... etc ... -- Craig Ringer
Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Thanks....
Regards
Tamanna
On 11/28/2011 05:30 PM, tamanna madaan wrote:Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??
I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:
ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.
If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:
gdb -q -p 1234 <<__END__
bt
q
__END__
If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.
You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6
... etc ...
--
Craig Ringer
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
Hi Craig
Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Well, it *really* shouldn't hang locally.
To help you further I'll need you to collect the information on the stuck process next time you encounter one and post that as a reply. Maybe with a bit more info we can see what might be going on.
--
Craig Ringer
well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .
I am using postgres on linux platform .
Hi Craig
Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Thanks....
Regards
TamannaOn Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:On 11/28/2011 05:30 PM, tamanna madaan wrote:Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??
I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:
ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.
If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:
gdb -q -p 1234 <<__END__
bt
q
__END__
If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.
You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6
... etc ...
--
Craig Ringer
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .
I am using postgres on linux platform .
On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <tamanna.madaan@globallogic.com> wrote:Hi Craig
Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.
Thanks....
Regards
TamannaOn Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:On 11/28/2011 05:30 PM, tamanna madaan wrote:Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??
I relised just after sending my last message:
You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:
ps -C psql -o wchan:80=
or
ps -p 1234 -o wchan:80=
... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.
If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:
gdb -q -p 1234 <<__END__
bt
q
__END__
If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.
You're looking for output that looks like:
#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6
... etc ...
--
Craig Ringer
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
On 1 Prosinec 2011, 12:57, tamanna madaan wrote: > Hi Craig > I am able to reproduce the issue now . I have postgres-8.1.2 installed in > cluster setup. Well, the first thing you should do is to upgrade, at least to the last 8.1 minor version, which is 8.1.22. It may very well be an already fixed bug (haven't checked). BTW the 8.1 branch is not supported for a long time, so upgrade to a more recent version if possible. Second - what OS are you using, what version? The keep-alive needs support at OS level, and if the OS is upgraded as frequently as the database (i.e. not at all), this might be already fixed. And finally - what do you mean by 'cluster setup'? Tomas
Hi Tomas
I tried it on the system having postgres-8.4.0 . And the behavior is same .
Cluster means a group of machines having postgres installed on all of them .
Same database is created on all the machines one of which working as master DB
on which operation (like insert/delete/update) will be performed and others working
as Slave Db which will get data replicated to them from master DB by slony . In my
cluster setup there are only two machines ( A and B ) one having master Db and other
being slave . I execute the below query from system A to system B :
psql -U<db name> -h<host ip of B> -c "select sleep(300);"
This query can be seen running on system B in `ps -eaf | grep postgres` output .
Now, while this query is going on, execute below command on system A which will block any packet coming to this machine :
iptables -I INPUT -i eth0 -j DROP .
Afer 5 mins (which is the sleep period) , the above query will finish on system B . But it can still be seen
running on system A . This may be because of the reason that the message (that the query is finished)
have not been received by system A .
Still I would assume that after (tcp_keepalive_time + tcp_keepalive_probes*tcp_keepalive_intvl) , the above
psql query should return on system A as well. But, this query doesn't return until it is killed manually .
What could be the reason of that ??
Well , I learnt below from the release notes of postgres :
== =========================================================================================
postgres 8.1
server side chnages :
Add configuration parameters to control TCP/IP keep-alive times for idle, interval, and count (Oliver Jowett)
These values can be changed to allow more rapid detection of lost client connections.
postgres 9.0
E.8.3.9. Development Tools
E.8.3.9.1. libpq
Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert Haas)
Keepalive settings were already supported on the server end of TCP connections.
==============================================================================================
Does this mean that TCP keep alive settings(that are provided in postgres 8.1 onwards) would only work for lost connections to server and
won't work in the case above as above case requires psql (which is client ) to be returned ?? And for the above case the TCP keepalive settings in libpq ( that are provided in postgres 9.0 onwards) would work ??
kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0. keepalive setting are as below :
postgresql.conf
#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
system level setiing :
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Regards
Tamanna
On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
> Hi Craig
> I am able to reproduce the issue now . I have postgres-8.1.2 installed in
> cluster setup.
Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.
Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.
And finally - what do you mean by 'cluster setup'?
Tomas
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
Hi Tomas
I tried it on the system having postgres-8.4.0 . And the behavior is same .
Cluster means a group of machines having postgres installed on all of them .
Same database is created on all the machines one of which working as master DB
on which operation (like insert/delete/update) will be performed and others working
as Slave Db which will get data replicated to them from master DB by slony . In my
cluster setup there are only two machines ( A and B ) one having master Db and other
being slave . I execute the below query from system A to system B :
psql -U<db name> -h<host ip of B> -c "select sleep(300);"
This query can be seen running on system B in `ps -eaf | grep postgres` output .
Now, while this query is going on, execute below command on system A which will block any packet coming to this machine :
iptables -I INPUT -i eth0 -j DROP .
Afer 5 mins (which is the sleep period) , the above query will finish on system B . But it can still be seen
running on system A . This may be because of the reason that the message (that the query is finished)
have not been received by system A .
Still I would assume that after (tcp_keepalive_time + tcp_keepalive_probes*tcp_keepalive_intvl) , the above
psql query should return on system A as well. But, this query doesn't return until it is killed manually .
What could be the reason of that ??
Well , I learnt below from the release notes of postgres :
== =========================================================================================
postgres 8.1
server side chnages :
Add configuration parameters to control TCP/IP keep-alive times for idle, interval, and count (Oliver Jowett)
These values can be changed to allow more rapid detection of lost client connections.
postgres 9.0
E.8.3.9. Development Tools
E.8.3.9.1. libpq
Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert Haas)
Keepalive settings were already supported on the server end of TCP connections.
==============================================================================================
Does this mean that TCP keep alive settings(that are provided in postgres 8.1 onwards) would only work for lost connections to server and
won't work in the case above as above case requires psql (which is client ) to be returned ?? And for the above case the TCP keepalive settings in libpq ( that are provided in postgres 9.0 onwards) would work ??
kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0. keepalive setting are as below :
postgresql.conf
#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
system level setiing :
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Regards
Tamanna
On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
> Hi Craig
> I am able to reproduce the issue now . I have postgres-8.1.2 installed in
> cluster setup.
Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.
Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.
And finally - what do you mean by 'cluster setup'?
Tomas
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com
--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA
Office: +0-120-406-2000 x 2971
www.globallogic.com