Thread: psql query gets stuck indefinitely

psql query gets stuck indefinitely

From

tamanna madaan

Date:

28 November 2011, 05:31:05

Hi All

I have postgres installed in cluster setup. My system has a script which executes the below query on remote system in cluster.

psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"

But somehow this query got stuck. It didnt return even after the remote system( on which this query was supposed to execute) is rebooted . What could be the reason ??

Re: psql query gets stuck indefinitely

From

Craig Ringer

Date:

28 November 2011, 22:22:39

On 11/28/2011 05:30 PM, tamanna madaan wrote:

Hi All

I have postgres installed in cluster setup. My system has a script which executes the below query on remote system in cluster.

psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"

But somehow this query got stuck. It didnt return even after the remote system( on which this query was supposed to execute) is rebooted . What could be the reason ??

The issue will most likely be related to the network or to the client-side host. Perhaps the client machine changed IP addresses (maybe as part of a switch from WiFi to wired or similar) ?

Check the man page for psql in 9.1; I think client-side keepalive support got committed for 9.1 . If it didn't, you can always set it globally for all TCP/IP connections on your system. See eg http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html .

--
Craig Ringer

Re: psql query gets stuck indefinitely

From

Craig Ringer

Date:

28 November 2011, 22:29:12

On 11/28/2011 05:30 PM, tamanna madaan wrote:
> Hi All
> I have postgres installed in cluster setup. My system has a script
> which executes the below query on remote system in cluster.
> psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
> But somehow this query got stuck. It didnt return even after the remote
> system( on which this query was supposed to execute) is rebooted . What
> could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which
system call it's blocked in in the kernel (if it's waiting on a
syscall). As you didn't mention your OS I'll assume you're on Linux,
where you'd use:

   ps -C psql -o wchan:80=

or

   ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting
for command line input I see it blocked in the kernel routine
"n_tty_read" for example.

If you really want to know what it's doing you can also attach gdb and
get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading
"no debugging symbols found" or lots of lines ending in "?? ()" then you
need to install debug symbols. How to do that depends on your OS/distro
so I won't go into that; it's documented on the PostgreSQL wiki under
"how to get a stack trace" but you probably won't want to bother if this
is just for curiosity's sake.

You're looking for output that looks like:

#1  0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2  0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3  0x000000369d215b11 in readline_internal_char () from
/lib64/libreadline.so.6
#4  0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

Re: psql query gets stuck indefinitely

From

tamanna madaan

Date:

28 November 2011, 23:22:26

Hi Craig

Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:

On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.

If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

Re: psql query gets stuck indefinitely

From

Craig Ringer

Date:

29 November 2011, 02:26:05

On 29/11/11 11:21, tamanna madaan wrote:

Hi Craig

Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Well, it *really* shouldn't hang locally.

To help you further I'll need you to collect the information on the stuck process next time you encounter one and post that as a reply. Maybe with a bit more info we can see what might be going on.

--
Craig Ringer

Re: psql query gets stuck indefinitely

From

tamanna madaan

Date:

29 November 2011, 07:27:09

well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .

I am using postgres on linux platform .

On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <tamanna.madaan@globallogic.com> wrote:

Hi Craig

Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.

If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

Re: psql query gets stuck indefinitely

From

tamanna madaan

Date:

01 December 2011, 07:58:07

Hi Craig

I am able to reproduce the issue now . I have postgres-8.1.2 installed in cluster setup.

I have started the below query from one system let say A to system B in cluster .

psql -U<dbname> -h<ip of system B> -c "select sleep(300);"

while this command is going on , system B is stopped abruptly by taking out the power cable from it . This caused the above query on system A to hang. This is still showing in 'ps -eaf' output after one day. I think the tcp keepalive mechanism which has been set at system level should have closed this connection. But it didnt . Following keepalive values have been set on system A :

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Why system level keepalive is not working in this case. Well, I learnt , from the link you have provided, that programs must request keepalive control for their sockets using the setsockopt interface. I wonder if postgres8.1.2 supports / request for system level keepalive control ?? If not, then which release/version of postgres supports that ??

Thanks...

Tamanna

On Tue, Nov 29, 2011 at 4:56 PM, tamanna madaan <tamanna.madaan@globallogic.com> wrote:

well, one question : Is tcp-keep-alive enabled by default in postgres-8.1.2 .

I am using postgres on linux platform . 
On Tue, Nov 29, 2011 at 8:51 AM, tamanna madaan <tamanna.madaan@globallogic.com> wrote:
Hi Craig

Thanks for your reply . But unfortunately I dont have that process running right now. I have already killed that process . But I have seen this problem sometimes on my setup.
It generally happens when the remote system is going slow for some reason (CPU utilization high etc.) . But whatever is the reason , I would assume that the query should return with some error or so
in case the system, the query is running on , is rebooted . But it doesn't return and remain stuck. Moreover, the same query sometimes hangs even if it is run on local postgres database so I dont think
network issues have any role in that . Please help.

Thanks....

Regards
Tamanna

On Tue, Nov 29, 2011 at 7:58 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 11/28/2011 05:30 PM, tamanna madaan wrote:
Hi All
I have postgres installed in cluster setup. My system has a script
which executes the below query on remote system in cluster.
psql -t -q -Uslon -h<hostip> -d<dbname> -c"select 1;"
But somehow this query got stuck. It didnt return even after the remote
system( on which this query was supposed to execute) is rebooted . What
could be the reason ??

I relised just after sending my last message:

You should use ps to find out what exactly psql is doing and which system call it's blocked in in the kernel (if it's waiting on a syscall). As you didn't mention your OS I'll assume you're on Linux, where you'd use:

ps -C psql -o wchan:80=

or

ps -p 1234 -o wchan:80=

... where "1234" is the pid of the stuck psql process. In a psql waiting for command line input I see it blocked in the kernel routine "n_tty_read" for example.

If you really want to know what it's doing you can also attach gdb and get a backtrace to see what code it's paused in inside psql:

gdb -q -p 1234 <<__END__
bt
q
__END__

If you get a message about "missing debuginfos", lots of lines reading "no debugging symbols found" or lots of lines ending in "?? ()" then you need to install debug symbols. How to do that depends on your OS/distro so I won't go into that; it's documented on the PostgreSQL wiki under "how to get a stack trace" but you probably won't want to bother if this is just for curiosity's sake.

You're looking for output that looks like:

#1 0x000000369d22a131 in rl_getc () from /lib64/libreadline.so.6
#2 0x000000369d22a8e9 in rl_read_key () from /lib64/libreadline.so.6
#3 0x000000369d215b11 in readline_internal_char () from /lib64/libreadline.so.6
#4 0x000000369d216065 in readline () from /lib64/libreadline.so.6

... etc ...

--
Craig Ringer

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com

Re: psql query gets stuck indefinitely

From

"Tomas Vondra"

Date:

01 December 2011, 09:58:22

On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
> Hi Craig
> I am able to reproduce the issue now . I have postgres-8.1.2 installed in
> cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

Re: psql query gets stuck indefinitely

From

tamanna madaan

Date:

05 December 2011, 03:15:38

Hi Tomas

I tried it on the system having postgres-8.4.0 . And the behavior is same .

Cluster means a group of machines having postgres installed on all of them .

Same database is created on all the machines one of which working as master DB

on which operation (like insert/delete/update) will be performed and others working

as Slave Db which will get data replicated to them from master DB by slony . In my

cluster setup there are only two machines ( A and B ) one having master Db and other

being slave . I execute the below query from system A to system B :

psql -U<db name> -h<host ip of B> -c "select sleep(300);"

This query can be seen running on system B in `ps -eaf | grep postgres` output .

Now, while this query is going on, execute below command on system A which will block any packet coming to this machine :

iptables -I INPUT -i eth0 -j DROP .

Afer 5 mins (which is the sleep period) , the above query will finish on system B . But it can still be seen

running on system A . This may be because of the reason that the message (that the query is finished)

have not been received by system A .

Still I would assume that after (tcp_keepalive_time + tcp_keepalive_probes*tcp_keepalive_intvl) , the above

psql query should return on system A as well. But, this query doesn't return until it is killed manually .

What could be the reason of that ??

Well , I learnt below from the release notes of postgres :

== =========================================================================================

postgres 8.1

server side chnages :

Add configuration parameters to control TCP/IP keep-alive times for idle, interval, and count (Oliver Jowett)

These values can be changed to allow more rapid detection of lost client connections.

postgres 9.0

E.8.3.9. Development Tools

E.8.3.9.1. libpq

Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert Haas)

Keepalive settings were already supported on the server end of TCP connections.

==============================================================================================

Does this mean that TCP keep alive settings(that are provided in postgres 8.1 onwards) would only work for lost connections to server and

won't work in the case above as above case requires psql (which is client ) to be returned ?? And for the above case the TCP keepalive settings in libpq ( that are provided in postgres 9.0 onwards) would work ??

kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0. keepalive setting are as below :

postgresql.conf

#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;

# 0 selects the system default

#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;

# 0 selects the system default

#tcp_keepalives_count = 0 # TCP_KEEPCNT;

# 0 selects the system default

system level setiing :

net.ipv4.tcp_keepalive_time = 7200

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 75

Regards

Tamanna

On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
> Hi Craig
> I am able to reproduce the issue now . I have postgres-8.1.2 installed in
> cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

Re: psql query gets stuck indefinitely

From

tamanna madaan

Date:

07 December 2011, 00:35:13

Hi All

Please help me .

Thanks...

Tamanna

On Mon, Dec 5, 2011 at 12:45 PM, tamanna madaan <tamanna.madaan@globallogic.com> wrote:

Hi Tomas
I tried it on the system having postgres-8.4.0 . And the behavior is same .
Cluster means a group of machines having postgres installed on all of them .
Same database is created on all the machines one of which working as master DB
on which operation (like insert/delete/update) will be performed and others working
as Slave Db which will get data replicated to them from master DB by slony . In my
cluster setup there are only two machines ( A and B ) one having master Db and other
being slave . I execute the below query from system A to system B :
psql -U<db name> -h<host ip of B> -c "select sleep(300);"
This query can be seen running on system B in `ps -eaf | grep postgres` output .
Now, while this query is going on, execute below command on system A which will block any packet coming to this machine :
iptables -I INPUT -i eth0 -j DROP .
Afer 5 mins (which is the sleep period) , the above query will finish on system B . But it can still be seen
running on system A . This may be because of the reason that the message (that the query is finished)
have not been received by system A .
Still I would assume that after (tcp_keepalive_time + tcp_keepalive_probes*tcp_keepalive_intvl) , the above
psql query should return on system A as well. But, this query doesn't return until it is killed manually .
What could be the reason of that ??

Well , I learnt below from the release notes of postgres :

== =========================================================================================

postgres 8.1

server side chnages :

Add configuration parameters to control TCP/IP keep-alive times for idle, interval, and count (Oliver Jowett)
These values can be changed to allow more rapid detection of lost client connections.

postgres 9.0

E.8.3.9. Development Tools
E.8.3.9.1. libpq

Add TCP keepalive settings in libpq (Tollef Fog Heen, Fujii Masao, Robert Haas)
Keepalive settings were already supported on the server end of TCP connections.

==============================================================================================

Does this mean that TCP keep alive settings(that are provided in postgres 8.1 onwards) would only work for lost connections to server and
won't work in the case above as above case requires psql (which is client ) to be returned ?? And for the above case the TCP keepalive settings in libpq ( that are provided in postgres 9.0 onwards) would work ??

kernel version on my system is 2.6.27.7-9-default and potstgres-8.4.0. keepalive setting are as below :

postgresql.conf

#tcp_keepalives_idle = 0 # TCP_KEEPIDLE, in seconds;
# 0 selects the system default
#tcp_keepalives_interval = 0 # TCP_KEEPINTVL, in seconds;
# 0 selects the system default
#tcp_keepalives_count = 0 # TCP_KEEPCNT;
# 0 selects the system default
system level setiing :
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
Regards
Tamanna

On Thu, Dec 1, 2011 at 7:28 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 1 Prosinec 2011, 12:57, tamanna madaan wrote:
> Hi Craig
> I am able to reproduce the issue now . I have postgres-8.1.2 installed in
> cluster setup.

Well, the first thing you should do is to upgrade, at least to the last
8.1 minor version, which is 8.1.22. It may very well be an already fixed
bug (haven't checked). BTW the 8.1 branch is not supported for a long
time, so upgrade to a more recent version if possible.

Second - what OS are you using, what version? The keep-alive needs support
at OS level, and if the OS is upgraded as frequently as the database (i.e.
not at all), this might be already fixed.

And finally - what do you mean by 'cluster setup'?

Tomas

--
Tamanna Madaan | Associate Consultant | GlobalLogic Inc.
Leaders in Software R&D Services
ARGENTINA | CHILE | CHINA | GERMANY | INDIA | ISRAEL | UKRAINE | UK | USA

Office: +0-120-406-2000 x 2971

www.globallogic.com