Thread: AWS and postgres issues

AWS and postgres issues

From

David Kerr

Date:

08 April 2013, 19:59:04

Howdy,

I'm having a couple of problems that I believe are related to AWS and I'm wondering
if anyone's seen them / overcome them.

Brief background, I'm running PG 9.2.4 in a VPC on Amazon Linux.
I'm also (attempting) to use PgPool for load balancing/failover.

The overall problem is that it seems like some Postgres commands / operations get truncated
at a network/packet level.

For example when I try to run ( From PgPool Server => Postgres Server)
ssh -vvv -T postgres@10.0.1.30 "/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart"

The command completes successfully on the Postgres server, and the process goes away,
however on the PgPool server that process never dies, it just hangs.

PgPool box:
ps -ef|grep -i ssh|grep -v sshd|grep -v grep
pgpool   27196 26241  0 19:57 pts/0    00:00:00 ssh -vvv postgres@10.0.1.30 bash -c '/usr/pgsql-9.2/bin/pg_ctl -D
/db/pg-m fast restart' 

Postgres box:
ps -ef|grep -i pg_ctl
postgres  2376 26436  0 19:58 pts/1    00:00:00 grep -i pg_ctl

Other non-postgres commands run over ssh return as expected.

I don't know if this is helpful, but here's an strace of the process:
setsockopt(3, SOL_IP, IP_TOS, [8], 4)   = 0
time(NULL)                              = 1365450845
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
time(NULL)                              = 1365450845
write(3, "2O\235qZ\333\2160\333\371\372\374\215\204\337X)\215\321J\5\343\240(\325\316\224W\370(7+"..., 176) = 176
time(NULL)                              = 1365450845
select(7, [3], [], NULL, NULL)          = 1 (in [3])
time(NULL)                              = 1365450845
read(3, "\303\223BDr5\376I\304Io\4\25\33\6\25>L\214\f_~J\342gc#w\365\5\320\242"..., 8192) = 80
time(NULL)                              = 1365450845
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
time(NULL)                              = 1365450845
read(3, "\352\366A\360c\315\t\310\361\24z\217H\t\314\342\361\322\335}l6\302)\223\343\361\27&{\234H"..., 8192) = 128
time(NULL)                              = 1365450845
select(7, [3 4], [5], NULL, NULL)       = 1 (out [5])
time(NULL)                              = 1365450845
write(5, "waiting for server to shut down."..., 35waiting for server to shut down....) = 35
time(NULL)                              = 1365450845
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
time(NULL)                              = 1365450846
read(3, "c\264\317\303Q\222\214b\323>\300\354\306j\36\31+\342\360\325Y8\345\322\211?<\0210n\253\211"..., 8192) = 64
time(NULL)                              = 1365450846
select(7, [3 4], [5], NULL, NULL)       = 1 (out [5])
time(NULL)                              = 1365450846
write(5, " done\nserver stopped\n", 21 done
server stopped
) = 21
time(NULL)                              = 1365450846
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
time(NULL)                              = 1365450846
read(3, "\253\210\306\251\343lF^6\32|v\374fe\23\32\3ylZ\325[\205\344,x@\4\201\213\351"..., 8192) = 64
time(NULL)                              = 1365450846
select(7, [3 4], [5], NULL, NULL)       = 1 (out [5])
time(NULL)                              = 1365450846
write(5, "server starting\n", 16server starting
)       = 16
time(NULL)                              = 1365450846
select(7, [3 4], [], NULL, NULL)        = 1 (in [3])
time(NULL)                              = 1365450846
read(3, "\373 \347w\354%\314<\6\215\314\207\7\202\274q\341:\270t\366\375\242{9\207:\222\374jy\373"..., 8192) = 128
close(4)                                = 0
time(NULL)                              = 1365450846
select(7, [3], [], NULL, NULL

and the same thing run with ssh -vvvv

debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug2: callback start
debug2: fd 3 setting TCP_NODELAY
debug3: packet_set_tos: set IP_TOS 0x08
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug3: Ignored env HOSTNAME
debug3: Ignored env SHELL
debug3: Ignored env TERM
debug3: Ignored env HISTSIZE
debug3: Ignored env EC2_AMITOOL_HOME
debug3: Ignored env OLDPWD
debug3: Ignored env USER
debug3: Ignored env LS_COLORS
debug3: Ignored env EC2_HOME
debug3: Ignored env MAIL
debug3: Ignored env PATH
debug3: Ignored env PWD
debug3: Ignored env JAVA_HOME
debug1: Sending env LANG = en_US.UTF-8
debug2: channel 0: request env confirm 0
debug3: Ignored env AWS_CLOUDWATCH_HOME
debug3: Ignored env AWS_IAM_HOME
debug3: Ignored env HISTCONTROL
debug3: Ignored env SHLVL
debug3: Ignored env HOME
debug3: Ignored env AWS_PATH
debug3: Ignored env AWS_AUTO_SCALING_HOME
debug3: Ignored env LOGNAME
debug3: Ignored env AWS_ELB_HOME
debug3: Ignored env LESSOPEN
debug3: Ignored env AWS_RDS_HOME
debug3: Ignored env G_BROKEN_FILENAMES
debug3: Ignored env _
debug1: Sending command: bash -c '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart'
debug2: channel 0: request exec confirm 1
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
waiting for server to shut down.... done
server stopped
server starting
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0
debug2: channel 0: rcvd eow
debug2: channel 0: close_read
debug2: channel 0: input open -> closed

Any help would be great. thanks

Dave

Re: AWS and postgres issues

From

Quentin Hartman

Date:

08 April 2013, 20:14:23

What version of pgpool are you using?

Are there other commands you have a problem with? I would suspect that the restart is causing the postgres server to go away, pgpool decides to disconnect, and then it has to be manually added back to the cluster. Unless of course you've got automatic failback setup, but even then I would expect that command to do weird things when issued through middleware like pgpool, regardless of what sort of infrastructure you are running on.

QH

Re: AWS and postgres issues

From

David Kerr

Date:

08 April 2013, 21:09:52

On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote:
- What version of pgpool are you using?
-
- Are there other commands you have a problem with? I would suspect that the
- restart is causing the postgres server to go away, pgpool decides to
- disconnect, and then it has to be manually added back to the cluster.
- Unless of course you've got automatic failback setup, but even then I would
- expect that command to do weird things when issued through middleware like
- pgpool, regardless of what sort of infrastructure you are running on.
-
- QH

This is actually from the command line, PgPool isn't involved at all. I just
mentioned it to give some context.

Re: AWS and postgres issues

From

David Kerr

Date:

08 April 2013, 22:00:09

On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote:
- On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote:
- - What version of pgpool are you using?
- -
- - Are there other commands you have a problem with? I would suspect that the
- - restart is causing the postgres server to go away, pgpool decides to
- - disconnect, and then it has to be manually added back to the cluster.
- - Unless of course you've got automatic failback setup, but even then I would
- - expect that command to do weird things when issued through middleware like
- - pgpool, regardless of what sort of infrastructure you are running on.
- -
- - QH
-
- This is actually from the command line, PgPool isn't involved at all. I just
- mentioned it to give some context.

I had a brief conversation with Quentin offline which indicated that I wasn't
being nearly clear nor direct enough.

I believe that this probelm is specific to AWS+Postgres. (and possibly specific to
VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine.
so far just pg_ctl fails.

I'm running the command directly from an interactive shell, so this is as basic as it gets.

The command runs correctly on the remote server, it just never exits the ssh connection.
specifically I never get: "debug2: channel 0: rcvd close  " as if the packet gets
dropped every time.

I don't believe it's a network hiccup because I can reproduce it every time.

It's likely something with Amazon's infrastructure that's eating it, but whateever
it is, it seems to specifically not like pg_ctl.

Here is what I see happen:
[pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart'
OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
[..snip..]
debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart
[..snip..]
waiting for server to shut down.... done
server stopped
server starting
[..snip..]
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0
debug2: channel 0: rcvd eow
debug2: channel 0: close_read
debug2: channel 0: input open -> closed
^Cdebug1: channel 0: free: client-session, nchannels 1                  # <---- This is where I ^C it
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1)


Notice the input open -> closed is where it basically hangs

Now look at:
[pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr'
OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
[..snip..]
debug1: Sending command: ls -ltr
debug2: channel 0: output open -> drain
debug1: channel 0: forcing write
total 8
drwx------ 4 postgres postgres 4096 Apr  4 22:50 9.2
drwx------ 2 postgres postgres 4096 Apr  5 22:58 bin
[..snip..]
debug2: channel 0: input open -> closed
debug2: channel 0: rcvd close
debug3: channel 0: will not send data after close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)

Transferred: sent 2456, received 2448 bytes, in 0.0 seconds
Bytes per second: sent 105666.4, received 105322.3
debug1: Exit status 0


Thanks.

Re: AWS and postgres issues

From

David Kerr

Date:

08 April 2013, 23:24:50

On Mon, Apr 08, 2013 at 02:59:56PM -0700, David Kerr wrote:
- On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote:
- - On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote:
- - - What version of pgpool are you using?
- - -
- - - Are there other commands you have a problem with? I would suspect that the
- - - restart is causing the postgres server to go away, pgpool decides to
- - - disconnect, and then it has to be manually added back to the cluster.
- - - Unless of course you've got automatic failback setup, but even then I would
- - - expect that command to do weird things when issued through middleware like
- - - pgpool, regardless of what sort of infrastructure you are running on.
- - -
- - - QH
- -
- - This is actually from the command line, PgPool isn't involved at all. I just
- - mentioned it to give some context.
-
- I had a brief conversation with Quentin offline which indicated that I wasn't
- being nearly clear nor direct enough.
-
- I believe that this probelm is specific to AWS+Postgres. (and possibly specific to
- VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine.
- so far just pg_ctl fails.
-
- I'm running the command directly from an interactive shell, so this is as basic as it gets.
-
- The command runs correctly on the remote server, it just never exits the ssh connection.
- specifically I never get: "debug2: channel 0: rcvd close  " as if the packet gets
- dropped every time.
-
- I don't believe it's a network hiccup because I can reproduce it every time.
-
- It's likely something with Amazon's infrastructure that's eating it, but whateever
- it is, it seems to specifically not like pg_ctl.
-
- Here is what I see happen:
- [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart'
- OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
- [..snip..]
- debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart
- [..snip..]
- waiting for server to shut down.... done
- server stopped
- server starting
- [..snip..]
- debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
- debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0
- debug2: channel 0: rcvd eow
- debug2: channel 0: close_read
- debug2: channel 0: input open -> closed
- ^Cdebug1: channel 0: free: client-session, nchannels 1                  # <---- This is where I ^C it
- debug3: channel 0: status: The following connections are open:
-   #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1)
-
-
- Notice the input open -> closed is where it basically hangs
-
- Now look at:
- [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr'
- OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
- [..snip..]
- debug1: Sending command: ls -ltr
- debug2: channel 0: output open -> drain
- debug1: channel 0: forcing write
- total 8
- drwx------ 4 postgres postgres 4096 Apr  4 22:50 9.2
- drwx------ 2 postgres postgres 4096 Apr  5 22:58 bin
- [..snip..]
- debug2: channel 0: input open -> closed
- debug2: channel 0: rcvd close
- debug3: channel 0: will not send data after close
- debug2: channel 0: almost dead
- debug2: channel 0: gc: notify user
- debug2: channel 0: gc: user detached
- debug2: channel 0: send close
- debug2: channel 0: is dead
- debug2: channel 0: garbage collecting
- debug1: channel 0: free: client-session, nchannels 1
- debug3: channel 0: status: The following connections are open:
-   #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
-
- Transferred: sent 2456, received 2448 bytes, in 0.0 seconds
- Bytes per second: sent 105666.4, received 105322.3
- debug1: Exit status 0


I've verified that it's not related to the linux flavor. I've tried it with a Server on both Amazon Linux
and Ubuntu. And with a client on Amazon Linux and my own desktop.

I can't be the only person using PG in AWS+VPC, can someone else with a similar test bed give it a shot
and tell me if it works for them? (at least then I'd know if it's likely something I'm doing...)

Thanks

Re: AWS and postgres issues

From

David Kerr

Date:

08 April 2013, 23:54:03

On Mon, Apr 08, 2013 at 04:24:45PM -0700, David Kerr wrote:
- On Mon, Apr 08, 2013 at 02:59:56PM -0700, David Kerr wrote:
- - On Mon, Apr 08, 2013 at 02:09:42PM -0700, David Kerr wrote:
- - - On Mon, Apr 08, 2013 at 02:14:14PM -0600, Quentin Hartman wrote:
- - - - What version of pgpool are you using?
- - - -
- - - - Are there other commands you have a problem with? I would suspect that the
- - - - restart is causing the postgres server to go away, pgpool decides to
- - - - disconnect, and then it has to be manually added back to the cluster.
- - - - Unless of course you've got automatic failback setup, but even then I would
- - - - expect that command to do weird things when issued through middleware like
- - - - pgpool, regardless of what sort of infrastructure you are running on.
- - - -
- - - - QH
- - -
- - - This is actually from the command line, PgPool isn't involved at all. I just
- - - mentioned it to give some context.
- -
- - I had a brief conversation with Quentin offline which indicated that I wasn't
- - being nearly clear nor direct enough.
- -
- - I believe that this probelm is specific to AWS+Postgres. (and possibly specific to
- - VPC+Amazon Linux). Non postgres commands run fine, and even psql works fine.
- - so far just pg_ctl fails.
- -
- - I'm running the command directly from an interactive shell, so this is as basic as it gets.
- -
- - The command runs correctly on the remote server, it just never exits the ssh connection.
- - specifically I never get: "debug2: channel 0: rcvd close  " as if the packet gets
- - dropped every time.
- -
- - I don't believe it's a network hiccup because I can reproduce it every time.
- -
- - It's likely something with Amazon's infrastructure that's eating it, but whateever
- - it is, it seems to specifically not like pg_ctl.
- -
- - Here is what I see happen:
- - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 '/usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart'
- - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
- - [..snip..]
- - debug1: Sending command: /usr/pgsql-9.2/bin/pg_ctl -D /db/pg -m fast restart
- - [..snip..]
- - waiting for server to shut down.... done
- - server stopped
- - server starting
- - [..snip..]
- - debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
- - debug1: client_input_channel_req: channel 0 rtype eow@openssh.com reply 0
- - debug2: channel 0: rcvd eow
- - debug2: channel 0: close_read
- - debug2: channel 0: input open -> closed
- - ^Cdebug1: channel 0: free: client-session, nchannels 1                  # <---- This is where I ^C it
- - debug3: channel 0: status: The following connections are open:
- -   #0 client-session (t4 r0 i3/0 o0/0 fd -1/5 cc -1)
- -
- -
- - Notice the input open -> closed is where it basically hangs
- -
- - Now look at:
- - [pgpool@ccpgp05 ~]$ ssh -vvv postgres@10.0.1.30 'ls -ltr'
- - OpenSSH_6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
- - [..snip..]
- - debug1: Sending command: ls -ltr
- - debug2: channel 0: output open -> drain
- - debug1: channel 0: forcing write
- - total 8
- - drwx------ 4 postgres postgres 4096 Apr  4 22:50 9.2
- - drwx------ 2 postgres postgres 4096 Apr  5 22:58 bin
- - [..snip..]
- - debug2: channel 0: input open -> closed
- - debug2: channel 0: rcvd close
- - debug3: channel 0: will not send data after close
- - debug2: channel 0: almost dead
- - debug2: channel 0: gc: notify user
- - debug2: channel 0: gc: user detached
- - debug2: channel 0: send close
- - debug2: channel 0: is dead
- - debug2: channel 0: garbage collecting
- - debug1: channel 0: free: client-session, nchannels 1
- - debug3: channel 0: status: The following connections are open:
- -   #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
- -
- - Transferred: sent 2456, received 2448 bytes, in 0.0 seconds
- - Bytes per second: sent 105666.4, received 105322.3
- - debug1: Exit status 0
-
-
- I've verified that it's not related to the linux flavor. I've tried it with a Server on both Amazon Linux
- and Ubuntu. And with a client on Amazon Linux and my own desktop.
-
- I can't be the only person using PG in AWS+VPC, can someone else with a similar test bed give it a shot
- and tell me if it works for them? (at least then I'd know if it's likely something I'm doing...)
-
- Thanks

While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
that we needed to be using '-t' where I was using -T or (neither).

So mystery solved!

Re: AWS and postgres issues

From

Tatsuo Ishii

Date:

09 April 2013, 00:35:04

> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
> that we needed to be using '-t' where I was using -T or (neither).

Are you sure? I checked the pg_ctl source code and could not find any
place attaching to the tty.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: AWS and postgres issues

From

Ian Lawrence Barwick

Date:

09 April 2013, 00:45:39

2013/4/9 Tatsuo Ishii <ishii@postgresql.org>:
>> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
>> that we needed to be using '-t' where I was using -T or (neither).
>
> Are you sure? I checked the pg_ctl source code and could not find any
> place attaching to the tty.

I think he means the ssh options -t and -T


Regards

Ian Barwick

Re: AWS and postgres issues

From

Tatsuo Ishii

Date:

09 April 2013, 00:52:27

> 2013/4/9 Tatsuo Ishii <ishii@postgresql.org>:
>>> While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
>>> that we needed to be using '-t' where I was using -T or (neither).
>>
>> Are you sure? I checked the pg_ctl source code and could not find any
>> place attaching to the tty.
>
> I think he means the ssh options -t and -T

Yes, I know. In my understanding, he is saying because pg_ctl attaches
to the tty, and ssh should be executed with -t (force ssh to allocate
pseudo-tty).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: AWS and postgres issues

From

David Kerr

Date:

09 April 2013, 02:20:24

On Apr 8, 2013, at 5:52 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:

2013/4/9 Tatsuo Ishii <ishii@postgresql.org>:
While debugging this with a coworker we figured out that pg_ctl was attaching to the tty and then it clicked
that we needed to be using '-t' where I was using -T or (neither).

Are you sure? I checked the pg_ctl source code and could not find any
place attaching to the tty.

I think he means the ssh options -t and -T

Yes, I know. In my understanding, he is saying because pg_ctl attaches
to the tty, and ssh should be executed with -t (force ssh to allocate
pseudo-tty).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Yeah, and i really expected to NOT want to attach to the pseudo-tty.

I was also sort of hoping that it was dropping packets or something like that because

then it might have been a similar problem to the one i reported here:

http://www.pgpool.net/pipermail/pgpool-general/2013-February/001418.html