Thread: Failing GSSAPI TCP when connecting to server
My application is trying to connect the database server, and meanwhile tries to talk to the KDC server for a service ticket. Earlier these TCP connections did run like this, and were successful: 13:57:53.788797 IP6 clientIPv6.54143 > serverIPv6.88: Flags [S], seq 4189109662, win 65535, options [mss 1440,nop,wscale6,sackOK,TS val 402503231 ecr 0], length 0 13:57:53.789053 IP6 serverIPv6.88 > clientIPv6.54143: Flags [S.], seq 1604590824, ack 4189109663, win 65535, options [mss1440,nop,wscale 6,sackOK,TS val 1505702993 ecr 402503231], length 0 13:57:53.789122 IP6 clientIPv6.54143 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 402503231 ecr 1505702993],length 0 13:57:53.789223 IP6 clientIPv6.54143 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val 402503231ecr 1505702993], length 4 13:57:53.829575 IP6 serverIPv6.88 > clientIPv6.54143: Flags [.], ack 5, win 1035, options [nop,nop,TS val 1505703037 ecr402503231], length 0 13:57:53.829696 IP6 clientIPv6.54143 > serverIPv6.88: Flags [P.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val 402503271ecr 1505703037], length 1161 13:57:53.832411 IP6 serverIPv6.88 > clientIPv6.54143: Flags [P.], seq 1:5, ack 1166, win 1035, options [nop,nop,TS val 1505703037ecr 402503271], length 4 13:57:53.832440 IP6 serverIPv6.88 > clientIPv6.54143: Flags [FP.], seq 5:907, ack 1166, win 1035, options [nop,nop,TS val1505703037 ecr 402503271], length 902 13:57:53.832461 IP6 clientIPv6.54143 > serverIPv6.88: Flags [.], ack 908, win 1021, options [nop,nop,TS val 402503276 ecr1505703037], length 0 13:57:53.832620 IP6 clientIPv6.54143 > serverIPv6.88: Flags [F.], seq 1166, ack 908, win 1035, options [nop,nop,TS val 402503276ecr 1505703037], length 0 13:57:53.832847 IP6 serverIPv6.88 > clientIPv6.54143: Flags [.], ack 1167, win 1035, options [nop,nop,TS val 1505703037 ecr402503276], length 0 After an upgrade of the application they now look like this: 16:53:09.637708 IP6 clientIPv6.49124 > serverIPv6.88: Flags [S], seq 2860816940, win 65535, options [mss 1440,nop,wscale6,sackOK,TS val 2664510889 ecr 0], length 0 16:53:09.637990 IP6 serverIPv6.88 > clientIPv6.49124: Flags [S.], seq 2595272492, ack 2860816941, win 65535, options [mss1440,nop,wscale 6,sackOK,TS val 3896716896 ecr 2664510889], length 0 16:53:09.638153 IP6 clientIPv6.49124 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 2664510889 ecr3896716896], length 0 16:53:09.638158 IP6 clientIPv6.49124 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val 2664510889ecr 3896716896], length 4 16:53:09.638162 IP6 clientIPv6.49124 > serverIPv6.88: Flags [FP.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val 2664510889ecr 3896716896], length 1161 16:53:09.638451 IP6 serverIPv6.88 > clientIPv6.49124: Flags [.], ack 1167, win 1017, options [nop,nop,TS val 3896716896 ecr2664510889], length 0 16:53:09.638508 IP clientIPv4.58934 > serverIPv4.88: Flags [S], seq 3474302110, win 65535, options [mss 1460,nop,wscale 6,sackOK,TSval 726603825 ecr 0], length 0 16:53:09.638696 IP serverIPv4.88 > clientIPv4.58934: Flags [S.], seq 2424341350, ack 3474302111, win 65535, options [mss1460,nop,wscale 6,sackOK,TS val 2662337458 ecr 726603825], length 0 16:53:09.638729 IP clientIPv4.58934 > serverIPv4.88: Flags [.], ack 1, win 1027, options [nop,nop,TS val 726603825 ecr 2662337458],length 0 16:53:09.638733 IP clientIPv4.58934 > serverIPv4.88: Flags [P.], seq 1:5, ack 1, win 1027, options [nop,nop,TS val 726603825ecr 2662337458], length 4 16:53:09.638751 IP clientIPv4.58934 > serverIPv4.88: Flags [FP.], seq 5:1166, ack 1, win 1027, options [nop,nop,TS val 726603825ecr 2662337458], length 1161 16:53:09.638856 IP6 clientIPv6.46516 > serverIPv6.88: Flags [S], seq 3686126664, win 65535, options [mss 1440,nop,wscale6,sackOK,TS val 1901751531 ecr 0], length 0 16:53:09.639054 IP serverIPv4.88 > clientIPv4.58934: Flags [.], ack 1167, win 1009, options [nop,nop,TS val 2662337458 ecr726603825], length 0 16:53:09.639071 IP6 serverIPv6.88 > clientIPv6.46516: Flags [S.], seq 3830196306, ack 3686126665, win 65535, options [mss1440,nop,wscale 6,sackOK,TS val 457425497 ecr 1901751531], length 0 16:53:09.639105 IP6 clientIPv6.46516 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 1901751531 ecr457425497], length 0 16:53:09.639130 IP6 clientIPv6.46516 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val 1901751531ecr 457425497], length 4 16:53:09.639154 IP6 clientIPv6.46516 > serverIPv6.88: Flags [FP.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val 1901751531ecr 457425497], length 1161 This is repeated LOTS of times, and what remains is hundreds of sockets in FIN_WAIT_2 that do not appear to disappear quickly. The error message from postgres client is GSSAPI continuation error: Miscellaneous failure: unable to reach any KDC in realm (In fact in does reach the KDC, only the TCP doesn't seem to work) A configuration problem on the machine(s) can be ruled out, because both old (working) and new (failing) application are installed on the same machine at the same time, using the same network, same hardware, same OS, same libgssapi and same postgres client software connecting to the same database. There are no errors logged on the KDC server, it appears to have orderly processed the requests (at least at first, it starts to complain later when the stale sockets get too many). The error message on the postgres server is FATAL: GSSAPI authentication failed for user "pmc" The OS is FreeBSD Release 13.4-p1 The postgres client library libpq.so.5 is Release 15.7 The application postgres interface is rubyGem pq Release 1.5.4 The application is Discourse 2.2.0 (working) and 2.3.1 (failing) What is going on there?
On Sun, Sep 29, 2024 at 2:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
My application is trying to connect the database server, and meanwhile
tries to talk to the KDC server for a service ticket.
Earlier these TCP connections did run like this, and were successful:
[snip]
A configuration problem on the machine(s) can be ruled out,
Famous last words.
because both
old (working) and new (failing) application are installed on the same
machine at the same time, using the same network, same hardware, same
OS, same libgssapi and same postgres client software connecting to the
same database.
There are no errors logged on the KDC server, it appears to have
orderly processed the requests (at least at first, it starts to
complain later when the stale sockets get too many).
The error message on the postgres server is
FATAL: GSSAPI authentication failed for user "pmc"
Is there a way to test pmc authentication via some other tool, like psql?
The OS is FreeBSD Release 13.4-p1
The postgres client library libpq.so.5 is Release 15.7
The application postgres interface is rubyGem pq Release 1.5.4
The application is Discourse 2.2.0 (working) and 2.3.1 (failing)
What is going on there?
If only the application changed, then by definition it can't be a database problem. Something in the application changed; you just haven't found it.
Specifically, I'd read the Discourse 2.3.0 and 2.3.1 release notes.
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!
Ron Johnson <ronljohnsonjr@gmail.com> writes: > On Sun, Sep 29, 2024 at 2:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote: >> My application is trying to connect the database server, and meanwhile >> tries to talk to the KDC server for a service ticket. >> A configuration problem on the machine(s) can be ruled out, > Famous last words. The TCP trace looks like the client side is timing out too quickly in the unsuccessful case. It's not clear to me how the different Discourse version would lead to the Kerberos library applying a different timeout. Still, it seems like most of the moving parts here are outside of Postgres' control --- I don't think that libpq itself has much involvement in the KDC communication. I concur with looking at the Discourse release notes and maybe asking some questions in that community. regards, tom lane
Hello Folks, Thanks for Your inspiration; and I made some progress (found a way to avoid the issue). The issue is most likely not related to postgres. Ron Johnson said: >> A configuration problem on the machine(s) can be ruled out, > Famous last words. Trust me. :) > Is there a way to test pmc authentication via some other tool, like psql? Sure, that works. The problem is contained inside the running application program(s), everything else doesn't show it. > If *only *the application changed, then by definition it can't be a > database problem. *Something* in the application changed; you just haven't > found it. Obviousely, yes. But then, such a change might expose an undesired behaviour elsewhere. > Specifically, I'd read the Discourse 2.3.0 and 2.3.1 release notes. Correction: it is actually 3.2.0 and 3.3.1. I finally went the way of bisecting, and, it's not really a problem in Discourse either. It comes from a feature I had enabled in the course of migrating, a filesystem change monitor based on kqueue: https://man.freebsd.org/cgi/man.cgi?query=kqueue Removing that feature solves the issue for now. I have still no idea how that tool might lead to mishandled sockets elsewhere; it might somehow have to do with the async processing of the DB connect. That would need a thorough look into the code where this is done. Tom Lane wrote: >The TCP trace looks like the client side is timing out too quickly >in the unsuccessful case. It's not clear to me how the different >Discourse version would lead to the Kerberos library applying a >different timeout. It's not a timeout; a timeout would close the socket. It seems to rather forget the socket. >Still, it seems like most of the moving parts >here are outside of Postgres' control --- I don't think that libpq >itself has much involvement in the KDC communication. Kerberos is weird. It goes into libgssapi, but libgssapi doesn't do much on it's own, it just maps so-called "mech"s, which then point to the actual kerberos code - which in the case of FreeBSD is very ancient (but work should be underway to modernize it). It's one of the most creepy pieces of code I've looked into. > I concur with looking at the Discourse release notes and maybe asking > some questions in that community. They only support that app to run in a certain containerization on a specific brand of Linux. They don't like my questions and might just delete them. Anyway, I have a lead now to either avoid the problem or where to look more closely. And it has not directly to do with postgres, but rather with genuine socket mishandling and/or maybe some flaw in FreeBSD. cheers, PMc