Thread: libpq contention due to gss even when not using gss

libpq contention due to gss even when not using gss

From

Andres Freund

Date:

10 June 2024, 18:12:12

Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C.  Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres
password=fake'

host=127.0.0.1:                           16465
host=127.0.0.1,gssencmode=disable         20860
host=/tmp:                                49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

And it doesn't seem great that something as rarely used as gss introduces
overhead to very common paths.

Here's a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
      - 16.15% futex_wake
           do_futex
           __x64_sys_futex
           do_syscall_64
         - entry_SYSCALL_64_after_hwframe
            - 16.15% __GI___lll_lock_wake
               - __GI___pthread_mutex_unlock_usercnt
                  - 5.12% gssint_select_mech_type
                     - 4.36% gss_inquire_attrs_for_mech
                        - 2.85% gss_indicate_mechs
                           - gss_indicate_mechs_by_attrs
                              - 1.58% gss_acquire_cred_from
                                   gss_acquire_cred
                                   pg_GSS_have_cred_cache
                                   select_next_encryption_method (inlined)
                                   init_allowed_encryption_methods (inlined)
                                   PQconnectPoll
                                   pqConnectDBStart (inlined)
                                   PQconnectStartParams
                                   PQconnectdbParams
                                   doConnect


And a bottom-up profile:

-   32.10%  pgbench  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
   - 32.09% queued_spin_lock_slowpath
      - 16.15% futex_wake
           do_futex
           __x64_sys_futex
           do_syscall_64
         - entry_SYSCALL_64_after_hwframe
            - 16.15% __GI___lll_lock_wake
               - __GI___pthread_mutex_unlock_usercnt
                  - 5.12% gssint_select_mech_type
                     - 4.36% gss_inquire_attrs_for_mech
                        - 2.85% gss_indicate_mechs
                           - gss_indicate_mechs_by_attrs
                              - 1.58% gss_acquire_cred_from
                                   gss_acquire_cred
                                   pg_GSS_have_cred_cache
                                   select_next_encryption_method (inlined)
                                   init_allowed_encryption_methods (inlined)
                                   PQconnectPoll
                                   pqConnectDBStart (inlined)
                                   PQconnectStartParams
                                   PQconnectdbParams
                                   doConnect



Clearly the contention originates outside of our code, but is triggered by
doing pg_GSS_have_cred_cache() every time a connection is established.

Greetings,

Andres Freund

Re: libpq contention due to gss even when not using gss

From

Dmitry Dolgov

Date:

13 June 2024, 15:33:57

> On Mon, Jun 10, 2024 at 11:12:12AM GMT, Andres Freund wrote:
> Hi,
>
> To investigate a report of both postgres and pgbouncer having issues when a
> lot of new connections aree established, I used pgbench -C.  Oddly, on an
> early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
> only when using TCP, not with unix sockets.
>
> c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres
password=fake'
>
> host=127.0.0.1:                           16465
> host=127.0.0.1,gssencmode=disable         20860
> host=/tmp:                                49286
>
> Note that the server does *not* support gss, yet gss has a substantial
> performance impact.
>
> Obviously the connection rates here absurdly high and outside of badly written
> applications likely never practically relevant. However, the number of cores
> in systems are going up, and this quite possibly will become relevant in more
> realistic scenarios (lock contention kicks in earlier the more cores you
> have).

By not supporting gss I assume you mean having built with --with-gssapi,
but only host (not hostgssenc) records in pg_hba, right?

Re: libpq contention due to gss even when not using gss

From

Andres Freund

Date:

13 June 2024, 17:30:24

Hi,

On 2024-06-13 17:33:57 +0200, Dmitry Dolgov wrote:
> > On Mon, Jun 10, 2024 at 11:12:12AM GMT, Andres Freund wrote:
> > Hi,
> >
> > To investigate a report of both postgres and pgbouncer having issues when a
> > lot of new connections aree established, I used pgbench -C.  Oddly, on an
> > early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
> > only when using TCP, not with unix sockets.
> >
> > c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres
password=fake'
> >
> > host=127.0.0.1:                           16465
> > host=127.0.0.1,gssencmode=disable         20860
> > host=/tmp:                                49286
> >
> > Note that the server does *not* support gss, yet gss has a substantial
> > performance impact.
> >
> > Obviously the connection rates here absurdly high and outside of badly written
> > applications likely never practically relevant. However, the number of cores
> > in systems are going up, and this quite possibly will become relevant in more
> > realistic scenarios (lock contention kicks in earlier the more cores you
> > have).
> 
> By not supporting gss I assume you mean having built with --with-gssapi,
> but only host (not hostgssenc) records in pg_hba, right?

Yes, the latter. Or not having kerberos set up on the client side.

Greetings,

Andres Freund

Re: libpq contention due to gss even when not using gss

From

Dmitry Dolgov

Date:

14 June 2024, 08:46:04

> On Thu, Jun 13, 2024 at 10:30:24AM GMT, Andres Freund wrote:
> > > To investigate a report of both postgres and pgbouncer having issues when a
> > > lot of new connections aree established, I used pgbench -C.  Oddly, on an
> > > early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
> > > only when using TCP, not with unix sockets.
> > >
> > > c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres
password=fake'
> > >
> > > host=127.0.0.1:                           16465
> > > host=127.0.0.1,gssencmode=disable         20860
> > > host=/tmp:                                49286
> > >
> > > Note that the server does *not* support gss, yet gss has a substantial
> > > performance impact.
> > >
> > > Obviously the connection rates here absurdly high and outside of badly written
> > > applications likely never practically relevant. However, the number of cores
> > > in systems are going up, and this quite possibly will become relevant in more
> > > realistic scenarios (lock contention kicks in earlier the more cores you
> > > have).
> >
> > By not supporting gss I assume you mean having built with --with-gssapi,
> > but only host (not hostgssenc) records in pg_hba, right?
>
> Yes, the latter. Or not having kerberos set up on the client side.

I've been experimenting with both:

* The server is built without gssapi, but the client does support it.
  This produces exactly the contention you're talking about.

* The server is built with gssapi, but do not use it in pg_hba, the
  client does support gssapi. In this case the difference between
  gssencmode=disable/prefer is even more dramatic in my test case
  (milliseconds vs seconds) due to the environment with configured
  kerberos (for other purposes, thus gss_init_sec_context spends huge
  amount of time to still return nothing).

At the same time after quick look I don't see an easy way to avoid that.
Current implementation tries to initialize gss before getting any
confirmation from the server whether it's supported. Doing this other
way around would probably just shift overhead to the server side.

Re: libpq contention due to gss even when not using gss

From

Daniel Gustafsson

Date:

14 June 2024, 10:12:55

> On 14 Jun 2024, at 10:46, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
>> On Thu, Jun 13, 2024 at 10:30:24AM GMT, Andres Freund wrote:
>>>> To investigate a report of both postgres and pgbouncer having issues when a
>>>> lot of new connections aree established, I used pgbench -C.  Oddly, on an
>>>> early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
>>>> only when using TCP, not with unix sockets.
>>>>
>>>> c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres
password=fake'
>>>>
>>>> host=127.0.0.1:                           16465
>>>> host=127.0.0.1,gssencmode=disable         20860
>>>> host=/tmp:                                49286
>>>>
>>>> Note that the server does *not* support gss, yet gss has a substantial
>>>> performance impact.
>>>>
>>>> Obviously the connection rates here absurdly high and outside of badly written
>>>> applications likely never practically relevant. However, the number of cores
>>>> in systems are going up, and this quite possibly will become relevant in more
>>>> realistic scenarios (lock contention kicks in earlier the more cores you
>>>> have).
>>>
>>> By not supporting gss I assume you mean having built with --with-gssapi,
>>> but only host (not hostgssenc) records in pg_hba, right?
>>
>> Yes, the latter. Or not having kerberos set up on the client side.
>
> I've been experimenting with both:
>
> * The server is built without gssapi, but the client does support it.
>  This produces exactly the contention you're talking about.
>
> * The server is built with gssapi, but do not use it in pg_hba, the
>  client does support gssapi. In this case the difference between
>  gssencmode=disable/prefer is even more dramatic in my test case
>  (milliseconds vs seconds) due to the environment with configured
>  kerberos (for other purposes, thus gss_init_sec_context spends huge
>  amount of time to still return nothing).
>
> At the same time after quick look I don't see an easy way to avoid that.
> Current implementation tries to initialize gss before getting any
> confirmation from the server whether it's supported. Doing this other
> way around would probably just shift overhead to the server side.

The main problem seems to be that we check whether or not there is a credential
cache when we try to select encryption but not yet authentication, as a way to
figure out if gssenc it as all worth trying?  I experimented with deferring it
with potentially cheaper heuristics in encryption selection, but it seems hard
to get around since other methods were even more expensive.

--
Daniel Gustafsson

Re: libpq contention due to gss even when not using gss

From

Dmitry Dolgov

Date:

14 June 2024, 15:52:27

> On Fri, Jun 14, 2024 at 12:12:55PM GMT, Daniel Gustafsson wrote:
> > I've been experimenting with both:
> >
> > * The server is built without gssapi, but the client does support it.
> >  This produces exactly the contention you're talking about.
> >
> > * The server is built with gssapi, but do not use it in pg_hba, the
> >  client does support gssapi. In this case the difference between
> >  gssencmode=disable/prefer is even more dramatic in my test case
> >  (milliseconds vs seconds) due to the environment with configured
> >  kerberos (for other purposes, thus gss_init_sec_context spends huge
> >  amount of time to still return nothing).
> >
> > At the same time after quick look I don't see an easy way to avoid that.
> > Current implementation tries to initialize gss before getting any
> > confirmation from the server whether it's supported. Doing this other
> > way around would probably just shift overhead to the server side.
>
> The main problem seems to be that we check whether or not there is a credential
> cache when we try to select encryption but not yet authentication, as a way to
> figure out if gssenc it as all worth trying?

Yep, this is my understanding as well. Which other methods did you try
for checking that?

Re: libpq contention due to gss even when not using gss

From

Andres Freund

Date:

14 June 2024, 16:02:01

Hi,

On 2024-06-14 10:46:04 +0200, Dmitry Dolgov wrote:
> At the same time after quick look I don't see an easy way to avoid that.
> Current implementation tries to initialize gss before getting any
> confirmation from the server whether it's supported. Doing this other
> way around would probably just shift overhead to the server side.

Initializing the gss cache at all isn't so much the problem. It's that we do
it for every connection. And that doing so requires locking inside gss. So
maybe we could just globally cache that gss isn't available, instead of
rediscovering it over and over for every new connection.

Greetings,

Andres Freund

Re: libpq contention due to gss even when not using gss

From

Tom Lane

Date:

14 June 2024, 16:27:12

Andres Freund <andres@anarazel.de> writes:
> Initializing the gss cache at all isn't so much the problem. It's that we do
> it for every connection. And that doing so requires locking inside gss. So
> maybe we could just globally cache that gss isn't available, instead of
> rediscovering it over and over for every new connection.

I had the impression that krb5 already had such a cache internally.
Maybe they don't cache the "failed" state though.  I doubt we'd
want to either in long-lived processes --- what if the user
installs the credential while we're running?

            regards, tom lane

Re: libpq contention due to gss even when not using gss

From

Andres Freund

Date:

14 June 2024, 16:45:04

Hi,

On 2024-06-14 12:27:12 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Initializing the gss cache at all isn't so much the problem. It's that we do
> > it for every connection. And that doing so requires locking inside gss. So
> > maybe we could just globally cache that gss isn't available, instead of
> > rediscovering it over and over for every new connection.
> 
> I had the impression that krb5 already had such a cache internally.

Well, if so, it clearly doesn't seem to work very well, given that it causes
contention at ~15k lookups/sec. That's obviously a trivial number for anything
cached, even with the worst possible locking regimen.

> Maybe they don't cache the "failed" state though.  I doubt we'd
> want to either in long-lived processes --- what if the user
> installs the credential while we're running?

If we can come up with something better - cool. But it doesn't seem great that
gss introduces contention for the vast majority of folks that use libpq in
environments that never use gss.

I don't think we should cache the set of credentials when gss is actually
available on a process-wide basis, just the fact that gss isn't available at
all.  I think it's very unlikely for that fact to change while an application
is running.  And if it happens, requiring a restart in those cases seems an
acceptable price to pay for what is effectively a niche feature.

Greetings,

Andres Freund