Thread: AWS and postgres issues
Howdy, I'm having a couple of problems that I believe are related to AWS and I'm wondering if anyone's seen them / overcome them. Brief background, I'm running PG 9.2.4 in a VPC on Amazon Linux. I'm also (attempting) to use PgPool for load balancing/failover. The overall problem is that it seems like some Postgres commands / operations get truncated at a network/packet level. For example when I try to run ( From PgPool Server => Postgres Server) ssh -vvv -T postgres@10.0.1.30 "/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart" The command completes successfully on the Postgres server, and the process goes away, however on the PgPool server that process never dies, it just hangs. PgPool box: ps -ef|grep -i ssh|grep -v sshd|grep -v grep pgpool 27196 26241 0 19:57 pts/0 00:00:00 ssh -vvv postgres@10.0.1.30 bash -c '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg-m fast restart' Postgres box: ps -ef|grep -i pg_ctl postgres 2376 26436 0 19:58 pts/1 00:00:00 grep -i pg_ctl Other non-postgres commands run over ssh return as expected. I don't know if this is helpful, but here's an strace of the process: setsockopt(3, SOL_IP, IP_TOS, [8], 4) = 0 time(NULL) = 1365450845 select(7, [3], [3], NULL, NULL) = 1 (out [3]) time(NULL) = 1365450845 write(3, "2O\235qZ\333\2160\333\371\372\374\215\204\337X)\215\321J\5\343\240(\325\316\224W\370(7+"..., 176) = 176 time(NULL) = 1365450845 select(7, [3], [], NULL, NULL) = 1 (in [3]) time(NULL) = 1365450845 read(3, "\303\223BDr5\376I\304Io\4\25\33\6\25>L\214\f_~J\342gc#w\365\5\320\242"..., 8192) = 80 time(NULL) = 1365450845 select(7, [3 4], [], NULL, NULL) = 1 (in [3]) time(NULL) = 1365450845 read(3, "\352\366A\360c\315\t\310\361\24z\217H\t\314\342\361\322\335}l6\302)\223\343\361\27&{\234H"..., 8192) = 128 time(NULL) = 1365450845 select(7, [3 4], [5], NULL, NULL) = 1 (out [5]) time(NULL) = 1365450845 write(5, "waiting for server to shut down."..., 35waiting for server to shut down....) = 35 time(NULL) = 1365450845 select(7, [3 4], [], NULL, NULL) = 1 (in [3]) time(NULL) = 1365450846 read(3, "c\264\317\303Q\222\214b\323>\300\354\306j\36\31+\342\360\325Y8\345\322\211?<\0210n\253\211"..., 8192) = 64 time(NULL) = 1365450846 select(7, [3 4], [5], NULL, NULL) = 1 (out [5]) time(NULL) = 1365450846 write(5, " done\nserver stopped\n", 21 done server stopped ) = 21 time(NULL) = 1365450846 select(7, [3 4], [], NULL, NULL) = 1 (in [3]) time(NULL) = 1365450846 read(3, "\253\210\306\251\343lF^6\32|v\374fe\23\32\3ylZ\325[\205\344,x@\4\201\213\351"..., 8192) = 64 time(NULL) = 1365450846 select(7, [3 4], [5], NULL, NULL) = 1 (out [5]) time(NULL) = 1365450846 write(5, "server starting\n", 16server starting ) = 16 time(NULL) = 1365450846 select(7, [3 4], [], NULL, NULL) = 1 (in [3]) time(NULL) = 1365450846 read(3, "\373 \347w\354%\314<\6\215\314\207\7\202\274q\341:\270t\366\375\242{9\207:\222\374jy\373"..., 8192) = 128 close(4) = 0 time(NULL) = 1365450846 select(7, [3], [], NULL, NULL and the same thing run with ssh -vvvv debug1: channel 0: new [client-session] debug3: ssh_session2_open: channel_new: 0 debug2: channel 0: send open debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. debug2: callback start debug2: fd 3 setting TCP_NODELAY debug3: packet_set_tos: set IP_TOS 0x08 debug2: client_session2_setup: id 0 debug1: Sending environment. debug3: Ignored env HOSTNAME debug3: Ignored env SHELL debug3: Ignored env TERM debug3: Ignored env HISTSIZE debug3: Ignored env EC2_AMITOOL_HOME debug3: Ignored env OLDPWD debug3: Ignored env USER debug3: Ignored env LS_COLORS debug3: Ignored env EC2_HOME debug3: Ignored env MAIL debug3: Ignored env PATH debug3: Ignored env PWD debug3: Ignored env JAVA_HOME debug1: Sending env LANG = en_US.UTF-8 debug2: channel 0: request env confirm 0 debug3: Ignored env AWS_CLOUDWATCH_HOME debug3: Ignored env AWS_IAM_HOME debug3: Ignored env HISTCONTROL debug3: Ignored env SHLVL debug3: Ignored env HOME debug3: Ignored env AWS_PATH debug3: Ignored env AWS_AUTO_SCALING_HOME debug3: Ignored env LOGNAME debug3: Ignored env AWS_ELB_HOME debug3: Ignored env LESSOPEN debug3: Ignored env AWS_RDS_HOME debug3: Ignored env G_BROKEN_FILENAMES debug3: Ignored env _ debug1: Sending command: bash -c '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart' debug2: channel 0: request exec confirm 1 debug2: callback done debug2: channel 0: open confirm rwindow 0 rmax 32768 debug2: channel 0: rcvd adjust 2097152 debug2: channel_input_status_confirm: type 99 id 0 debug2: exec request accepted on channel 0 waiting for server to shut down.... done server stopped server starting debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0 debug2: channel 0: rcvd eow debug2: channel 0: close_read debug2: channel 0: input open -> closed Any help would be great. thanks Dave
What version of pgpool are you using?
Are there other commands you have a problem with? I would suspect that the restart is causing the postgres server to go away, pgpool decides to disconnect, and then it has to be manually added back to the cluster. Unless of course you've got automatic failback setup, but even then I would expect that command to do weird things when issued through middleware like pgpool, regardless of what sort of infrastructure you are running on.
QH
Are there other commands you have a problem with? I would suspect that the restart is causing the postgres server to go away, pgpool decides to disconnect, and then it has to be manually added back to the cluster. Unless of course you've got automatic failback setup, but even then I would expect that command to do weird things when issued through middleware like pgpool, regardless of what sort of infrastructure you are running on.
QH
On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote: - What version of pgpool are you using? - - Are there other commands you have a problem with? I would suspect that the - restart is causing the postgres server to go away, pgpool decides to - disconnect, and then it has to be manually added back to the cluster. - Unless of course you've got automatic failback setup, but even then I would - expect that command to do weird things when issued through middleware like - pgpool, regardless of what sort of infrastructure you are running on. - - QH This is actually from the command line, PgPool isn't involved at all. I just mentioned it to give some context.
On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote: - On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote: - - What version of pgpool are you using? - - - - Are there other commands you have a problem with? I would suspect that the - - restart is causing the postgres server to go away, pgpool decides to - - disconnect, and then it has to be manually added back to the cluster. - - Unless of course you've got automatic failback setup, but even then I would - - expect that command to do weird things when issued through middleware like - - pgpool, regardless of what sort of infrastructure you are running on. - - - - QH - - This is actually from the command line, PgPool isn't involved at all. I just - mentioned it to give some context. I had a brief conversation with Quentin offline which indicated that I wasn't being nearly clear nor direct enough. I believe that this probelm is specific to AWS+Postgres. (and possibly specific to VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine. so far just pg_ctl fails. I'm running the command directly from an interactive shell, so this is as basic as it gets. The command runs correctly on the remote server, it just never exits the ssh connection. specifically I never get: "debug2: channel 0: rcvd close " as if the packet gets dropped every time. I don't believe it's a network hiccup because I can reproduce it every time. It's likely something with Amazon's infrastructure that's eating it, but whateever it is, it seems to specifically not like pg_ctl. Here is what I see happen: [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart' OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 [..snip..] debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart [..snip..] waiting for server to shut down.... done server stopped server starting [..snip..] debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0 debug2: channel 0: rcvd eow debug2: channel 0: close_read debug2: channel 0: input open -> closed ^Cdebug1: channel 0: free: client-session, nchannels 1 # <---- This is where I ^C it debug3: channel 0: status: The following connections are open: #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1) Notice the input open -> closed is where it basically hangs Now look at: [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr' OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 [..snip..] debug1: Sending command: ls -ltr debug2: channel 0: output open -> drain debug1: channel 0: forcing write total 8 drwx------ 4 postgres postgres 4096 Apr 4 22:50 9.2 drwx------ 2 postgres postgres 4096 Apr 5 22:58 bin [..snip..] debug2: channel 0: input open -> closed debug2: channel 0: rcvd close debug3: channel 0: will not send data after close debug2: channel 0: almost dead debug2: channel 0: gc: notify user debug2: channel 0: gc: user detached debug2: channel 0: send close debug2: channel 0: is dead debug2: channel 0: garbage collecting debug1: channel 0: free: client-session, nchannels 1 debug3: channel 0: status: The following connections are open: #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1) Transferred: sent 2456, received 2448 bytes, in 0.0 seconds Bytes per second: sent 105666.4, received 105322.3 debug1: Exit status 0 Thanks.
On Mon, Apr 08, 2013 at 02:59:56PM -0700, David Kerr wrote: - On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote: - - On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote: - - - What version of pgpool are you using? - - - - - - Are there other commands you have a problem with? I would suspect that the - - - restart is causing the postgres server to go away, pgpool decides to - - - disconnect, and then it has to be manually added back to the cluster. - - - Unless of course you've got automatic failback setup, but even then I would - - - expect that command to do weird things when issued through middleware like - - - pgpool, regardless of what sort of infrastructure you are running on. - - - - - - QH - - - - This is actually from the command line, PgPool isn't involved at all. I just - - mentioned it to give some context. - - I had a brief conversation with Quentin offline which indicated that I wasn't - being nearly clear nor direct enough. - - I believe that this probelm is specific to AWS+Postgres. (and possibly specific to - VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine. - so far just pg_ctl fails. - - I'm running the command directly from an interactive shell, so this is as basic as it gets. - - The command runs correctly on the remote server, it just never exits the ssh connection. - specifically I never get: "debug2: channel 0: rcvd close " as if the packet gets - dropped every time. - - I don't believe it's a network hiccup because I can reproduce it every time. - - It's likely something with Amazon's infrastructure that's eating it, but whateever - it is, it seems to specifically not like pg_ctl. - - Here is what I see happen: - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart' - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 - [..snip..] - debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart - [..snip..] - waiting for server to shut down.... done - server stopped - server starting - [..snip..] - debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 - debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0 - debug2: channel 0: rcvd eow - debug2: channel 0: close_read - debug2: channel 0: input open -> closed - ^Cdebug1: channel 0: free: client-session, nchannels 1 # <---- This is where I ^C it - debug3: channel 0: status: The following connections are open: - #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1) - - - Notice the input open -> closed is where it basically hangs - - Now look at: - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr' - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 - [..snip..] - debug1: Sending command: ls -ltr - debug2: channel 0: output open -> drain - debug1: channel 0: forcing write - total 8 - drwx------ 4 postgres postgres 4096 Apr 4 22:50 9.2 - drwx------ 2 postgres postgres 4096 Apr 5 22:58 bin - [..snip..] - debug2: channel 0: input open -> closed - debug2: channel 0: rcvd close - debug3: channel 0: will not send data after close - debug2: channel 0: almost dead - debug2: channel 0: gc: notify user - debug2: channel 0: gc: user detached - debug2: channel 0: send close - debug2: channel 0: is dead - debug2: channel 0: garbage collecting - debug1: channel 0: free: client-session, nchannels 1 - debug3: channel 0: status: The following connections are open: - #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1) - - Transferred: sent 2456, received 2448 bytes, in 0.0 seconds - Bytes per second: sent 105666.4, received 105322.3 - debug1: Exit status 0 I've verified that it's not related to the linux flavor. I've tried it with a Server on both Amazon Linux and Ubuntu. And with a client on Amazon Linux and my own desktop. I can't be the only person using PG in AWS+VPC, can someone else with a similar test bed give it a shot and tell me if it works for them? (at least then I'd know if it's likely something I'm doing...) Thanks
On Mon, Apr 08, 2013 at 04:24:45PM -0700, David Kerr wrote: - On Mon, Apr 08, 2013 at 02:59:56PM -0700, David Kerr wrote: - - On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote: - - - On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote: - - - - What version of pgpool are you using? - - - - - - - - Are there other commands you have a problem with? I would suspect that the - - - - restart is causing the postgres server to go away, pgpool decides to - - - - disconnect, and then it has to be manually added back to the cluster. - - - - Unless of course you've got automatic failback setup, but even then I would - - - - expect that command to do weird things when issued through middleware like - - - - pgpool, regardless of what sort of infrastructure you are running on. - - - - - - - - QH - - - - - - This is actually from the command line, PgPool isn't involved at all. I just - - - mentioned it to give some context. - - - - I had a brief conversation with Quentin offline which indicated that I wasn't - - being nearly clear nor direct enough. - - - - I believe that this probelm is specific to AWS+Postgres. (and possibly specific to - - VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine. - - so far just pg_ctl fails. - - - - I'm running the command directly from an interactive shell, so this is as basic as it gets. - - - - The command runs correctly on the remote server, it just never exits the ssh connection. - - specifically I never get: "debug2: channel 0: rcvd close " as if the packet gets - - dropped every time. - - - - I don't believe it's a network hiccup because I can reproduce it every time. - - - - It's likely something with Amazon's infrastructure that's eating it, but whateever - - it is, it seems to specifically not like pg_ctl. - - - - Here is what I see happen: - - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart' - - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 - - [..snip..] - - debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart - - [..snip..] - - waiting for server to shut down.... done - - server stopped - - server starting - - [..snip..] - - debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 - - debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0 - - debug2: channel 0: rcvd eow - - debug2: channel 0: close_read - - debug2: channel 0: input open -> closed - - ^Cdebug1: channel 0: free: client-session, nchannels 1 # <---- This is where I ^C it - - debug3: channel 0: status: The following connections are open: - - #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1) - - - - - - Notice the input open -> closed is where it basically hangs - - - - Now look at: - - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr' - - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013 - - [..snip..] - - debug1: Sending command: ls -ltr - - debug2: channel 0: output open -> drain - - debug1: channel 0: forcing write - - total 8 - - drwx------ 4 postgres postgres 4096 Apr 4 22:50 9.2 - - drwx------ 2 postgres postgres 4096 Apr 5 22:58 bin - - [..snip..] - - debug2: channel 0: input open -> closed - - debug2: channel 0: rcvd close - - debug3: channel 0: will not send data after close - - debug2: channel 0: almost dead - - debug2: channel 0: gc: notify user - - debug2: channel 0: gc: user detached - - debug2: channel 0: send close - - debug2: channel 0: is dead - - debug2: channel 0: garbage collecting - - debug1: channel 0: free: client-session, nchannels 1 - - debug3: channel 0: status: The following connections are open: - - #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1) - - - - Transferred: sent 2456, received 2448 bytes, in 0.0 seconds - - Bytes per second: sent 105666.4, received 105322.3 - - debug1: Exit status 0 - - - I've verified that it's not related to the linux flavor. I've tried it with a Server on both Amazon Linux - and Ubuntu. And with a client on Amazon Linux and my own desktop. - - I can't be the only person using PG in AWS+VPC, can someone else with a similar test bed give it a shot - and tell me if it works for them? (at least then I'd know if it's likely something I'm doing...) - - Thanks While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked that we needed to be using '-t' where I was using -T or (neither). So mystery solved!
> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked > that we needed to be using '-t' where I was using -T or (neither). Are you sure? I checked the pg_ctl source code and could not find any place attaching to the tty. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
2013/4/9 Tatsuo Ishii <ishii@postgresql.org>: >> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked >> that we needed to be using '-t' where I was using -T or (neither). > > Are you sure? I checked the pg_ctl source code and could not find any > place attaching to the tty. I think he means the ssh options -t and -T Regards Ian Barwick
> 2013/4/9 Tatsuo Ishii <ishii@postgresql.org>: >>> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked >>> that we needed to be using '-t' where I was using -T or (neither). >> >> Are you sure? I checked the pg_ctl source code and could not find any >> place attaching to the tty. > > I think he means the ssh options -t and -T Yes, I know. In my understanding, he is saying because pg_ctl attaches to the tty, and ssh should be executed with -t (force ssh to allocate pseudo-tty). -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Apr 8, 2013, at 5:52 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
2013/4/9 Tatsuo Ishii <ishii@postgresql.org>:While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
that we needed to be using '-t' where I was using -T or (neither).
Are you sure? I checked the pg_ctl source code and could not find any
place attaching to the tty.
I think he means the ssh options -t and -T
Yes, I know. In my understanding, he is saying because pg_ctl attaches
to the tty, and ssh should be executed with -t (force ssh to allocate
pseudo-tty).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
Yeah, and i really expected to NOT want to attach to the pseudo-tty.
I was also sort of hoping that it was dropping packets or something like that because
then it might have been a similar problem to the one i reported here: