Thread: TCP Resets when closing connection opened via SSL

TCP Resets when closing connection opened via SSL

From
Jānis Pūris
Date:

Hello,


I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library. All seems to be all right, until I close the connection, at which point a packet with TCP reset is produced.

This has become very problematic and creates extensive noise in monitoring.

It also happens with IntelliJ DataGrip provided Java driver and PHP drivers. And hence I do not think this is the drivers problem, but PG instead.


Script

import psycopg2
import sys


def check():   # Relying on .pgpass for password   con = psycopg2.connect('user=monitoring dbname=postgres host=127.0.0.1')   cur = con.cursor()   cur.execute("select 'keepalived healthcheck'")   cur.close()   con.close()

try:   check()   print("ok")
except Exception as e:   print("not ok")   print(str(e))   sys.exit(1)

tcpdump tcpdump -v 'tcp[tcpflags] & (tcp-rst) != 0' -ilo

16:27:45.307006 IP (tos 0x0, ttl 64, id 8123, offset 0, flags [DF], proto TCP (6), length 40)   localhost.40797 > localhost.postgres: Flags [R], cksum 0x0cca (correct), seq 3830516781, win 0, length 0

From PG side of things, all seems to be OK

2019-04-23 16:27:45.300 CEST process=15615 c= t=0 s=5cbf20e1.3cff [unknown]@127.0.0.1:[unknown] app=[unknown] LOG:  connection received: host=127.0.0.1 port=40797
2019-04-23 16:27:45.304 CEST process=15615 c=authentication t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  connection authorized: user=monitoring database=postgres SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)
2019-04-23 16:27:45.306 CEST process=15615 c=BEGIN t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.095 ms
2019-04-23 16:27:45.306 CEST process=15615 c=SELECT t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.234 ms
2019-04-23 16:27:45.306 CEST process=15615 c=idle in transaction t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  disconnection: session time: 0:00:00.006 user=monitoring database=postgres host=127.0.0.1 port=40797

I've tried various combinations of OS, psycopg2 and PG versions to no avail. My focus has been to reproduce with this driver only.

  • OS: Centos 7 and Ubuntu 18.04
  • psycopg2: From latest down to 2.7.x 
  • PG: Latest 11 down to 9.4 (we use 9.4-bdr currently in production)
I can not reproduce this, when SSL has been disabled on PG. However this is not a solution, as we are committed to use SSL for all our connections to DB.

Has anyone experienced this and could share any advice how I could prevent this from producing network packets with TCP [R] flag ?


Thank you very much in advance and any advice will be much appreciated.


P.S. I've also posted this on following channels:


Med vennlig hilsen.
Best regards, Janis Puris.

Re: TCP Resets when closing connection opened via SSL

From
Jānis Pūris
Date:
Resending the original message without any formatting.

Hello,

I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library. All seems to be all right, until I close the connection, at which point a packet with TCP reset is produced.

This has become very problematic and creates extensive noise in monitoring.

It also happens with IntelliJ DataGrip provided Java driver and PHP drivers. And hence I do not think this is the drivers problem, but PG instead.

Script

import psycopg2
import sys

def check():
   # Relying on .pgpass for password
   con = psycopg2.connect('user=monitoring dbname=postgres host=127.0.0.1')
   cur = con.cursor()
   cur.execute("select 'keepalived healthcheck'")
   cur.close()
   con.close()

try:
   check()
   print("ok")
except Exception as e:
   print("not ok")
   print(str(e))
   sys.exit(1)

tcpdump tcpdump -v 'tcp[tcpflags] & (tcp-rst) != 0' -ilo

16:27:45.307006 IP (tos 0x0, ttl 64, id 8123, offset 0, flags [DF], proto TCP (6), length 40)
   localhost.40797 > localhost.postgres: Flags [R], cksum 0x0cca (correct), seq 3830516781, win 0, length 0

From PG side of things, all seems to be OK

2019-04-23 16:27:45.300 CEST process=15615 c= t=0 s=5cbf20e1.3cff [unknown]@127.0.0.1:[unknown] app=[unknown] LOG:  connection received: host=127.0.0.1 port=40797
2019-04-23 16:27:45.304 CEST process=15615 c=authentication t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  connection authorized: user=monitoring database=postgres SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)
2019-04-23 16:27:45.306 CEST process=15615 c=BEGIN t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.095 ms
2019-04-23 16:27:45.306 CEST process=15615 c=SELECT t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.234 ms
2019-04-23 16:27:45.306 CEST process=15615 c=idle in transaction t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  disconnection: session time: 0:00:00.006 user=monitoring database=postgres host=127.0.0.1 port=40797

I've tried various combinations of OS, psycopg2 and PG versions to no avail. My focus has been to reproduce with this driver only.

• OS: Centos 7 and Ubuntu 18.04
• psycopg2: From latest down to 2.7.x
• PG: Latest 11 down to 9.4 (we use 9.4-bdr currently in production)

I can not reproduce this, when SSL has been disabled on PG. However this is not a solution, as we are committed to use SSL for all our connections to DB.

Has anyone experienced this and could share any advice how I could prevent this from producing network packets with TCP [R] flag ?

Thank you very much in advance and any advice will be much appreciated.

P.S. I've also posted this on following channels:

• https://dba.stackexchange.com/questions/235497/tcp-resets-when-using-psycopg2
• https://github.com/psycopg/psycopg2/issues/906

Med vennlig hilsen.
Best regards, Janis Puris.

Med vennlig hilsen.
Best regards, Janis Puris.
On 26 Apr 2019, 09:34 +0200, Jānis Pūris <janis@puris.lv>, wrote:

Hello,


I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library. All seems to be all right, until I close the connection, at which point a packet with TCP reset is produced.

This has become very problematic and creates extensive noise in monitoring.

It also happens with IntelliJ DataGrip provided Java driver and PHP drivers. And hence I do not think this is the drivers problem, but PG instead.


Script

import psycopg2
import sys


def check():   # Relying on .pgpass for password   con = psycopg2.connect('user=monitoring dbname=postgres host=127.0.0.1')   cur = con.cursor()   cur.execute("select 'keepalived healthcheck'")   cur.close()   con.close()

try:   check()   print("ok")
except Exception as e:   print("not ok")   print(str(e))   sys.exit(1)

tcpdump tcpdump -v 'tcp[tcpflags] & (tcp-rst) != 0' -ilo

16:27:45.307006 IP (tos 0x0, ttl 64, id 8123, offset 0, flags [DF], proto TCP (6), length 40)   localhost.40797 > localhost.postgres: Flags [R], cksum 0x0cca (correct), seq 3830516781, win 0, length 0

From PG side of things, all seems to be OK

2019-04-23 16:27:45.300 CEST process=15615 c= t=0 s=5cbf20e1.3cff [unknown]@127.0.0.1:[unknown] app=[unknown] LOG:  connection received: host=127.0.0.1 port=40797
2019-04-23 16:27:45.304 CEST process=15615 c=authentication t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  connection authorized: user=monitoring database=postgres SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)
2019-04-23 16:27:45.306 CEST process=15615 c=BEGIN t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.095 ms
2019-04-23 16:27:45.306 CEST process=15615 c=SELECT t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  duration: 0.234 ms
2019-04-23 16:27:45.306 CEST process=15615 c=idle in transaction t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG:  disconnection: session time: 0:00:00.006 user=monitoring database=postgres host=127.0.0.1 port=40797

I've tried various combinations of OS, psycopg2 and PG versions to no avail. My focus has been to reproduce with this driver only.

  • OS: Centos 7 and Ubuntu 18.04
  • psycopg2: From latest down to 2.7.x 
  • PG: Latest 11 down to 9.4 (we use 9.4-bdr currently in production)
I can not reproduce this, when SSL has been disabled on PG. However this is not a solution, as we are committed to use SSL for all our connections to DB.

Has anyone experienced this and could share any advice how I could prevent this from producing network packets with TCP [R] flag ?


Thank you very much in advance and any advice will be much appreciated.


P.S. I've also posted this on following channels:


Med vennlig hilsen.
Best regards, Janis Puris.

Re: TCP Resets when closing connection opened via SSL

From
Rainer Pruy
Date:
On 26.04.19 09:41, Jānis Pūris wrote:
Resending the original message without any formatting.

Hello,

I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library. All seems to be all right, until I close the connection, at which point a packet with TCP reset is produced.

This has become very problematic and creates extensive noise in monitoring.

It also happens with IntelliJ DataGrip provided Java driver and PHP drivers. And hence I do not think this is the drivers problem, but PG instead.


From PG side of things, all seems to be OK

What actual problem you are experiencing?

Usually, getting a TCP reset is a normal event on TCP connections not necessarily related to any application level activity.

A TCP reset is triggered as soon as a packet is being received on a "closed" connection.

Such "late orphaned packet" may e.g. result from an ack for a packet in transit by the time of close.


For more detailed information you might look at the exact sequence of packets being exchanged.

This way, you might get better insight in what is happening and better information on wether something is going wrong.


That you do experience TCP reset only with SSL being involved might just be accidentally or be a

consequence of  SSL stack activities. It could even be caused by the way the (SSL) connection

is being configured/set-up in the specific case.

But nevertheless, it might be perfectly OK for such packets to be generated.

So, please re-think why you are considering them problematic in the first place.


Regards

Rainer

Re: TCP Resets when closing connection opened via SSL

From
Tom Lane
Date:
=?utf-8?Q?J=C4=81nis_P=C5=ABris?= <janis@puris.lv> writes:
> I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library.
Allseems to be all right, until I close the connection, at which point a packet with TCP reset is produced. 

Hm.  It's fairly obvious from the postmaster log that the client side
is not bothering to close the transaction it started:

> 2019-04-23 16:27:45.306 CEST process=15615 c=BEGIN t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown]
LOG: duration: 0.095 ms 
> 2019-04-23 16:27:45.306 CEST process=15615 c=SELECT t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown]
LOG: duration: 0.234 ms 
> 2019-04-23 16:27:45.306 CEST process=15615 c=idle in transaction t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres
app=[unknown]LOG:  disconnection: session time: 0:00:00.006 user=monitoring database=postgres host=127.0.0.1 port=40797 

There's no COMMIT or ROLLBACK to go with the BEGIN, and that's reflected
in the fact that the disconnection message shows c=idle in transaction.
Now, I doubt that would have any impact on the TCP-level session behavior,
but it suggests that maybe the client isn't bothering to close the session
cleanly either.  The RST rather than FIN would then likely be caused by
SSL having to do an unclean shutdown.

            regards, tom lane



Re: TCP Resets when closing connection opened via SSL

From
Jānis Pūris
Date:
Thanks for the insight, Tom.

It's fairly obvious from the postmaster log that the client side
is not bothering to close the transaction it started

Thats what I was also thinking, but I've managed to reproduce it with autocommit or commit before closing connection as well.
This is only reproducible with SSL connection.

I've reproduced it with ubuntu package bundles certs, my own created, generated with CA etc hence I do not think it is a problem with certificates themselves.
This happens with various clients, php driver, java driver and also libpq used by psycopg2. We are running a lot of different kind of microservices and thus far I can tell, almost (if not all) of them have this problem manifest.

I can not however reproduce this when doing something like
$ psql "user=postgres sslmode=require host=localhost dbname=postgres"
postgres=# \q

So, please re-think why you are considering them problematic in the first place.

This is an issue because we have a lot of connections being initiated and closed. And this creates a lot of TCP resets.

We could go with something like pgbouncer in front of it to pool the connections in order to reduce the reset count, but this would not be solving the root cause, but merely working around it.

Other than that, this does not seem to "break" anything. This is basically large noise for monitoring and huge numbers of resets, when looking at something like netstat. And by no means, this happens casually.

Has no one expieriencing this ?
Should I just assume this is "normal" and forget about it ? I'd still want to understand what's going on thought.

Med vennlig hilsen.
Best regards, Janis Puris.
On 26 Apr 2019, 16:17 +0200, Tom Lane <tgl@sss.pgh.pa.us>, wrote:
Jānis Pūris <janis@puris.lv> writes:
I'm trying to do a simple health check for keepalived and other services via a python script and psycopg2 library. All seems to be all right, until I close the connection, at which point a packet with TCP reset is produced.

Hm. It's fairly obvious from the postmaster log that the client side
is not bothering to close the transaction it started:

2019-04-23 16:27:45.306 CEST process=15615 c=BEGIN t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG: duration: 0.095 ms
2019-04-23 16:27:45.306 CEST process=15615 c=SELECT t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG: duration: 0.234 ms
2019-04-23 16:27:45.306 CEST process=15615 c=idle in transaction t=0 s=5cbf20e1.3cff monitoring@127.0.0.1:postgres app=[unknown] LOG: disconnection: session time: 0:00:00.006 user=monitoring database=postgres host=127.0.0.1 port=40797

There's no COMMIT or ROLLBACK to go with the BEGIN, and that's reflected
in the fact that the disconnection message shows c=idle in transaction.
Now, I doubt that would have any impact on the TCP-level session behavior,
but it suggests that maybe the client isn't bothering to close the session
cleanly either. The RST rather than FIN would then likely be caused by
SSL having to do an unclean shutdown.

regards, tom lane

Re: TCP Resets when closing connection opened via SSL

From
Tom Lane
Date:
=?utf-8?Q?J=C4=81nis_P=C5=ABris?= <janis@puris.lv> writes:
> This is only reproducible with SSL connection.

Seems like you might get more insight from OpenSSL experts, then.
(No idea where you find some, but they must exist.)

            regards, tom lane



Re: TCP Resets when closing connection opened via SSL

From
Jeremy Harris
Date:
On 26/04/2019 18:19, Tom Lane wrote:
> =?utf-8?Q?J=C4=81nis_P=C5=ABris?= <janis@puris.lv> writes:
>> This is only reproducible with SSL connection.
> 
> Seems like you might get more insight from OpenSSL experts, then.
> (No idea where you find some, but they must exist.)

List-Id: <openssl-users.openssl.org>

-- 
Cheers,
  Jeremy



Re: TCP Resets when closing connection opened via SSL

From
"Peter J. Holzer"
Date:
On 2019-04-26 18:36:12 +0200, Jānis Pūris wrote:
> Thanks for the insight, Tom.
>
>
>     It's fairly obvious from the postmaster log that the client side
>     is not bothering to close the transaction it started
>
>
> Thats what I was also thinking, but I've managed to reproduce it with
> autocommit or commit before closing connection as well.
> This is only reproducible with SSL connection.
>
> I've reproduced it with ubuntu package bundles certs, my own created, generated
> with CA etc hence I do not think it is a problem with certificates themselves.
> This happens with various clients, php driver, java driver and also libpq used
> by psycopg2. We are running a lot of different kind of microservices and thus
> far I can tell, almost (if not all) of them have this problem manifest.
>
> I can not however reproduce this when doing something like
> $ psql "user=postgres sslmode=require host=localhost dbname=postgres"
> postgres=# \q

I can reproduce it in this case also (kernel 4.9, openssl 1.1.0j,
posgres 9.5.16).

stracing psql shows this:

send(3, "\27\3\3\0\35fV\354\31v\332\10\22\202-#8\5\267\5\202\346\3748\353\326&\374\rf\2776"..., 34, MSG_NOSIGNAL) = 34
send(3, "\25\3\3\0\32fV\354\31v\332\10\23\307\207\335V\355Lhh\"\214\331\25l\352\27\32\332\357", 31, MSG_NOSIGNAL) = 31
close(3)                                = 0

I.e., psql sends some data (presumably a TLS close_notify alert) and
then closes the file descriptor.

tcpdump shows this:

12:30:02.276188 IP6 localhost.47712 > localhost.postgresql: Flags [P.], seq 414:445, ack 1239, win 360, options
[nop,nop,TSval 337416186 ecr 337416186], length 31 
12:30:02.276243 IP6 localhost.47712 > localhost.postgresql: Flags [F.], seq 445, ack 1239, win 360, options [nop,nop,TS
val337416186 ecr 337416186], length 0 
12:30:02.276306 IP6 localhost.postgresql > localhost.47712: Flags [P.], seq 1239:1270, ack 446, win 350, options
[nop,nop,TSval 337416186 ecr 337416186], length 31 
12:30:02.276315 IP6 localhost.47712 > localhost.postgresql: Flags [R], seq 2477517213, win 0, length 0

The first packet shown are the 31 bytes of the second send, then the
cliens sends a FIN packet. Then the server replies with a packet of the
same length (so presumably also a close_notify alert). Since the client
has already closed the connection, it replies with a RST.

One might argue that openssl should wait for the other side to reply
before closing the connection. But waiting doesn't really accomplish
anything from the application's point of view and introduces and
unnecessary delay.

> This is an issue because we have a lot of connections being initiated and
> closed. And this creates a lot of TCP resets.

Why are those resets a problem? (If the answer is "our monitoring
software complains about them" then the question becomes "why does your
monitoring software think they are a problem?")

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: TCP Resets when closing connection opened via SSL

From
Jānis Pūris
Date:
Thank you, Peter.

It makes more sense now, and I'm so glad someone else is able to reproduce this.

To clarify on why I consider this a problem - It is not just the alerts on monitoring system. The alerts themselves can be tuned / thresholds increased etc. The real issue (could argue this is non-issue), is the noise..

If something start to not work as expected besides this "inconvenience" and start to produce TCP resets as well, this could potentially diminish the troubleshooting tool visibility.

That said. Now that someone else is able to reproduce this and I more clarity as to what may be going on (can explain to team), I can consider this as "known behaviour".

Hopefully this will be addressed someday, but until then, we'll just have to live with it. :)

Med vennlig hilsen.
Best regards, Janis Puris.
On 27 Apr 2019, 12:55 +0200, Peter J. Holzer <hjp-pgsql@hjp.at>, wrote:
On 2019-04-26 18:36:12 +0200, Jānis Pūris wrote:
Thanks for the insight, Tom.


It's fairly obvious from the postmaster log that the client side
is not bothering to close the transaction it started


Thats what I was also thinking, but I've managed to reproduce it with
autocommit or commit before closing connection as well.
This is only reproducible with SSL connection.

I've reproduced it with ubuntu package bundles certs, my own created, generated
with CA etc hence I do not think it is a problem with certificates themselves.
This happens with various clients, php driver, java driver and also libpq used
by psycopg2. We are running a lot of different kind of microservices and thus
far I can tell, almost (if not all) of them have this problem manifest.

I can not however reproduce this when doing something like
$ psql "user=postgres sslmode=require host=localhost dbname=postgres"
postgres=# \q

I can reproduce it in this case also (kernel 4.9, openssl 1.1.0j,
posgres 9.5.16).

stracing psql shows this:

send(3, "\27\3\3\0\35fV\354\31v\332\10\22\202-#8\5\267\5\202\346\3748\353\326&\374\rf\2776"..., 34, MSG_NOSIGNAL) = 34
send(3, "\25\3\3\0\32fV\354\31v\332\10\23\307\207\335V\355Lhh\"\214\331\25l\352\27\32\332\357", 31, MSG_NOSIGNAL) = 31
close(3) = 0

I.e., psql sends some data (presumably a TLS close_notify alert) and
then closes the file descriptor.

tcpdump shows this:

12:30:02.276188 IP6 localhost.47712 > localhost.postgresql: Flags [P.], seq 414:445, ack 1239, win 360, options [nop,nop,TS val 337416186 ecr 337416186], length 31
12:30:02.276243 IP6 localhost.47712 > localhost.postgresql: Flags [F.], seq 445, ack 1239, win 360, options [nop,nop,TS val 337416186 ecr 337416186], length 0
12:30:02.276306 IP6 localhost.postgresql > localhost.47712: Flags [P.], seq 1239:1270, ack 446, win 350, options [nop,nop,TS val 337416186 ecr 337416186], length 31
12:30:02.276315 IP6 localhost.47712 > localhost.postgresql: Flags [R], seq 2477517213, win 0, length 0

The first packet shown are the 31 bytes of the second send, then the
cliens sends a FIN packet. Then the server replies with a packet of the
same length (so presumably also a close_notify alert). Since the client
has already closed the connection, it replies with a RST.

One might argue that openssl should wait for the other side to reply
before closing the connection. But waiting doesn't really accomplish
anything from the application's point of view and introduces and
unnecessary delay.

This is an issue because we have a lot of connections being initiated and
closed. And this creates a lot of TCP resets.

Why are those resets a problem? (If the answer is "our monitoring
software complains about them" then the question becomes "why does your
monitoring software think they are a problem?")

hp

--
_ | Peter J. Holzer | we build much bigger, better disasters now
|_|_) | | because we have much more sophisticated
| | | hjp@hjp.at | management tools.
__/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>