Thread: Failing GSSAPI TCP when connecting to server

Failing GSSAPI TCP when connecting to server

From
Peter
Date:
My application is trying to connect the database server, and meanwhile
tries to talk to the KDC server for a service ticket. 
Earlier these TCP connections did run like this, and were successful:

13:57:53.788797 IP6 clientIPv6.54143 > serverIPv6.88: Flags [S], seq 4189109662, win 65535, options [mss
1440,nop,wscale6,sackOK,TS val 402503231 ecr 0], length 0
 
13:57:53.789053 IP6 serverIPv6.88 > clientIPv6.54143: Flags [S.], seq 1604590824, ack 4189109663, win 65535, options
[mss1440,nop,wscale 6,sackOK,TS val 1505702993 ecr 402503231], length 0
 
13:57:53.789122 IP6 clientIPv6.54143 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 402503231 ecr
1505702993],length 0
 
13:57:53.789223 IP6 clientIPv6.54143 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val
402503231ecr 1505702993], length 4
 
13:57:53.829575 IP6 serverIPv6.88 > clientIPv6.54143: Flags [.], ack 5, win 1035, options [nop,nop,TS val 1505703037
ecr402503231], length 0
 
13:57:53.829696 IP6 clientIPv6.54143 > serverIPv6.88: Flags [P.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val
402503271ecr 1505703037], length 1161
 
13:57:53.832411 IP6 serverIPv6.88 > clientIPv6.54143: Flags [P.], seq 1:5, ack 1166, win 1035, options [nop,nop,TS val
1505703037ecr 402503271], length 4
 
13:57:53.832440 IP6 serverIPv6.88 > clientIPv6.54143: Flags [FP.], seq 5:907, ack 1166, win 1035, options [nop,nop,TS
val1505703037 ecr 402503271], length 902
 
13:57:53.832461 IP6 clientIPv6.54143 > serverIPv6.88: Flags [.], ack 908, win 1021, options [nop,nop,TS val 402503276
ecr1505703037], length 0
 
13:57:53.832620 IP6 clientIPv6.54143 > serverIPv6.88: Flags [F.], seq 1166, ack 908, win 1035, options [nop,nop,TS val
402503276ecr 1505703037], length 0
 
13:57:53.832847 IP6 serverIPv6.88 > clientIPv6.54143: Flags [.], ack 1167, win 1035, options [nop,nop,TS val 1505703037
ecr402503276], length 0
 

After an upgrade of the application they now look like this:

16:53:09.637708 IP6 clientIPv6.49124 > serverIPv6.88: Flags [S], seq 2860816940, win 65535, options [mss
1440,nop,wscale6,sackOK,TS val 2664510889 ecr 0], length 0
 
16:53:09.637990 IP6 serverIPv6.88 > clientIPv6.49124: Flags [S.], seq 2595272492, ack 2860816941, win 65535, options
[mss1440,nop,wscale 6,sackOK,TS val 3896716896 ecr 2664510889], length 0
 
16:53:09.638153 IP6 clientIPv6.49124 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 2664510889
ecr3896716896], length 0
 
16:53:09.638158 IP6 clientIPv6.49124 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val
2664510889ecr 3896716896], length 4
 
16:53:09.638162 IP6 clientIPv6.49124 > serverIPv6.88: Flags [FP.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val
2664510889ecr 3896716896], length 1161
 
16:53:09.638451 IP6 serverIPv6.88 > clientIPv6.49124: Flags [.], ack 1167, win 1017, options [nop,nop,TS val 3896716896
ecr2664510889], length 0
 
16:53:09.638508 IP clientIPv4.58934 > serverIPv4.88: Flags [S], seq 3474302110, win 65535, options [mss 1460,nop,wscale
6,sackOK,TSval 726603825 ecr 0], length 0
 
16:53:09.638696 IP serverIPv4.88 > clientIPv4.58934: Flags [S.], seq 2424341350, ack 3474302111, win 65535, options
[mss1460,nop,wscale 6,sackOK,TS val 2662337458 ecr 726603825], length 0
 
16:53:09.638729 IP clientIPv4.58934 > serverIPv4.88: Flags [.], ack 1, win 1027, options [nop,nop,TS val 726603825 ecr
2662337458],length 0
 
16:53:09.638733 IP clientIPv4.58934 > serverIPv4.88: Flags [P.], seq 1:5, ack 1, win 1027, options [nop,nop,TS val
726603825ecr 2662337458], length 4
 
16:53:09.638751 IP clientIPv4.58934 > serverIPv4.88: Flags [FP.], seq 5:1166, ack 1, win 1027, options [nop,nop,TS val
726603825ecr 2662337458], length 1161
 
16:53:09.638856 IP6 clientIPv6.46516 > serverIPv6.88: Flags [S], seq 3686126664, win 65535, options [mss
1440,nop,wscale6,sackOK,TS val 1901751531 ecr 0], length 0
 
16:53:09.639054 IP serverIPv4.88 > clientIPv4.58934: Flags [.], ack 1167, win 1009, options [nop,nop,TS val 2662337458
ecr726603825], length 0
 
16:53:09.639071 IP6 serverIPv6.88 > clientIPv6.46516: Flags [S.], seq 3830196306, ack 3686126665, win 65535, options
[mss1440,nop,wscale 6,sackOK,TS val 457425497 ecr 1901751531], length 0
 
16:53:09.639105 IP6 clientIPv6.46516 > serverIPv6.88: Flags [.], ack 1, win 1035, options [nop,nop,TS val 1901751531
ecr457425497], length 0
 
16:53:09.639130 IP6 clientIPv6.46516 > serverIPv6.88: Flags [P.], seq 1:5, ack 1, win 1035, options [nop,nop,TS val
1901751531ecr 457425497], length 4
 
16:53:09.639154 IP6 clientIPv6.46516 > serverIPv6.88: Flags [FP.], seq 5:1166, ack 1, win 1035, options [nop,nop,TS val
1901751531ecr 457425497], length 1161
 

This is repeated LOTS of times, and what remains is hundreds of
sockets in FIN_WAIT_2 that do not appear to disappear quickly.
The error message from postgres client is
GSSAPI continuation error: Miscellaneous failure: unable to reach any KDC in realm 
(In fact in does reach the KDC, only the TCP doesn't seem to work)

A configuration problem on the machine(s) can be ruled out, because both
old (working) and new (failing) application are installed on the same
machine at the same time, using the same network, same hardware, same
OS, same libgssapi and same postgres client software connecting to the
same database.

There are no errors logged on the KDC server, it appears to have
orderly processed the requests (at least at first, it starts to
complain later when the stale sockets get too many).

The error message on the postgres server is
FATAL:  GSSAPI authentication failed for user "pmc"

The OS is FreeBSD Release 13.4-p1
The postgres client library libpq.so.5 is Release 15.7
The application postgres interface is rubyGem pq Release 1.5.4
The application is Discourse 2.2.0 (working) and 2.3.1 (failing)

What is going on there?



Re: Failing GSSAPI TCP when connecting to server

From
Ron Johnson
Date:
On Sun, Sep 29, 2024 at 2:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
My application is trying to connect the database server, and meanwhile
tries to talk to the KDC server for a service ticket.
Earlier these TCP connections did run like this, and were successful:

[snip] 

A configuration problem on the machine(s) can be ruled out,

Famous last words.
 
because both
old (working) and new (failing) application are installed on the same
machine at the same time, using the same network, same hardware, same
OS, same libgssapi and same postgres client software connecting to the
same database.

There are no errors logged on the KDC server, it appears to have
orderly processed the requests (at least at first, it starts to
complain later when the stale sockets get too many).

The error message on the postgres server is
FATAL:  GSSAPI authentication failed for user "pmc"

Is there a way to test pmc authentication via some other tool, like psql?
 
The OS is FreeBSD Release 13.4-p1
The postgres client library libpq.so.5 is Release 15.7
The application postgres interface is rubyGem pq Release 1.5.4
The application is Discourse 2.2.0 (working) and 2.3.1 (failing)

What is going on there?

If only the application changed, then by definition it can't be a database problem.  Something in the application changed; you just haven't found it.
 
Specifically, I'd read the Discourse 2.3.0 and 2.3.1 release notes.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!

Re: Failing GSSAPI TCP when connecting to server

From
Tom Lane
Date:
Ron Johnson <ronljohnsonjr@gmail.com> writes:
> On Sun, Sep 29, 2024 at 2:00 PM Peter <pmc@citylink.dinoex.sub.org> wrote:
>> My application is trying to connect the database server, and meanwhile
>> tries to talk to the KDC server for a service ticket.

>> A configuration problem on the machine(s) can be ruled out,

> Famous last words.

The TCP trace looks like the client side is timing out too quickly
in the unsuccessful case.  It's not clear to me how the different
Discourse version would lead to the Kerberos library applying a
different timeout.  Still, it seems like most of the moving parts
here are outside of Postgres' control --- I don't think that libpq
itself has much involvement in the KDC communication.

I concur with looking at the Discourse release notes and maybe asking
some questions in that community.

            regards, tom lane



Re: Failing GSSAPI TCP when connecting to server

From
Peter
Date:
Hello Folks,

  Thanks for Your inspiration; and I made some progress (found
a way to avoid the issue).

The issue is most likely not related to postgres.

Ron Johnson said:

>> A configuration problem on the machine(s) can be ruled out,
> Famous last words.

Trust me. :)

> Is there a way to test pmc authentication via some other tool, like psql?

Sure, that works. The problem is contained inside the running
application program(s), everything else doesn't show it.

> If *only *the application changed, then by definition it can't be a
> database problem.  *Something* in the application changed; you just haven't
> found it.

Obviousely, yes. But then, such a change might expose an undesired
behaviour elsewhere.

> Specifically, I'd read the Discourse 2.3.0 and 2.3.1 release notes.

Correction: it is actually 3.2.0 and 3.3.1.

I finally went the way of bisecting, and, it's not really a problem in
Discourse either. It comes from a feature I had enabled in the course
of migrating, a filesystem change monitor based on kqueue:
   https://man.freebsd.org/cgi/man.cgi?query=kqueue
Removing that feature solves the issue for now.

I have still no idea how that tool might lead to mishandled sockets
elsewhere; it might somehow have to do with the async processing of
the DB connect. That would need a thorough look into the code where
this is done.

Tom Lane wrote:

>The TCP trace looks like the client side is timing out too quickly
>in the unsuccessful case. It's not clear to me how the different
>Discourse version would lead to the Kerberos library applying a
>different timeout.

It's not a timeout; a timeout would close the socket. It seems to
rather forget the socket.

>Still, it seems like most of the moving parts
>here are outside of Postgres' control --- I don't think that libpq
>itself has much involvement in the KDC communication.

Kerberos is weird. It goes into libgssapi, but libgssapi doesn't
do much on it's own, it just maps so-called "mech"s, which then point
to the actual kerberos code - which in the case of FreeBSD is very
ancient (but work should be underway to modernize it). It's one of
the most creepy pieces of code I've looked into.

> I concur with looking at the Discourse release notes and maybe asking
> some questions in that community.

They only support that app to run in a certain containerization
on a specific brand of Linux. They don't like my questions and
might just delete them.

Anyway, I have a lead now to either avoid the problem or where to
look more closely. And it has not directly to do with postgres, but
rather with genuine socket mishandling and/or maybe some flaw in
FreeBSD.

cheers,
PMc