Thread: Database logins taking longer and longer, showing up as "authentication" in ps(1)

  Hi,


We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
sometimes exhibits a behaviour I can't quite understand: simply logging
into the database starts to take minutes to complete.

We have 60 processes (workers) running on different machines accessing
the database, that all grab jobs from a queue and update rows in a table
after doing some calculations (which vary in time from <1s to perhaps a
minute, many of them fast).

Sometimes new database logins slow down, from usually taking <0.05s to
taking minutes. This is for psql as a normal user using Kerberos, for
psql as the postgres superuser, for the web-application logging into the
database, for everything.

What I see in the output of ps(1) is a lot of postgres processes in
state "authentication", when this problem occurs. It "feels" like the
logins are waiting for something - when e.g. psql finally logs in,
running queries seem normal.

The only "special" thing I can think of is that the workers use advisory
locks on rows in the table they are updating (as there can be multiple
jobs in flight at the same time, for the same row).

I can imagine that a pattern of jobs and workers could make the locking
cause those workers to wait and slow each other down, but I don't
understand how it can affect the time it takes for new database logins.

When this happens, if I stop the 60 workers, the login times are back to
normal within seconds, and all the postgres processes showing
"authentication" switch away from that state. If I start the workers
again, the login problem appears within a couple of minutes.

If I start a smaller number of workers, say 5, the problem doesn't
appear.

Could this be lock contention on a database table somehow affecting
logins?

I have tried searching for this problem, and found some older references
to things² that seem to have been fixed in the meantime. Is anybody else
seeing this?

The server is configured with max_connections 1200 and currently has
~400 connections. It has 1.5 TB of memory and 112 cores (including
hyperthreading) - the load is not high when I see this problem, between
20 and 30.

Any ideas and suggestions are welcome - if there is some relevant
information I can provide, I will be happy to try.


  Best regards,

    Adam


¹ We should plan to upgrade to 11.9, yes.

² E.g. https://www.postgresql.org/message-id/CAJw8uJQGoobesbPCMbxj6Vb4nv9D-GgvZ%2B7pK%2Bfckbb4DqJEAg%40mail.gmail.com
   and https://www.postgresql.org/message-id/20140312232810.634.7997%40wrigleys.postgresql.org

-- 
 "We don't care about crashing our cars                     Adam Sjøgren
  Pedal to the metal and our heads among the stars"    asjo@koldfront.dk




Greetings,

* Adam Sjøgren (asjo@koldfront.dk) wrote:
> We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
> sometimes exhibits a behaviour I can't quite understand: simply logging
> into the database starts to take minutes to complete.
>
> We have 60 processes (workers) running on different machines accessing
> the database, that all grab jobs from a queue and update rows in a table
> after doing some calculations (which vary in time from <1s to perhaps a
> minute, many of them fast).
>
> Sometimes new database logins slow down, from usually taking <0.05s to
> taking minutes. This is for psql as a normal user using Kerberos, for
> psql as the postgres superuser, for the web-application logging into the
> database, for everything.

When in doubt, blame DNS.

Alternatively, in your case, the issue might be the KDC taking forever
to issue a ticket for the service.

Note that, in both cases, this is a client-side issue.  Once the ticket
has been acquired for the PG service, the actual authentication on the
server side should be very fast and not depend on things external
(though you might check if you have log_hostnames on..).

Thanks,

Stephen

Attachment
Laurenz writes:

> On Tue, 2020-08-18 at 14:53 +0200, Adam Sjøgren wrote:
>>
>> We have 60 processes (workers) running on different machines accessing
>> the database, that all grab jobs from a queue and update rows in a table
>> after doing some calculations (which vary in time from <1s to perhaps a
>> minute, many of them fast).
>>
>> Sometimes new database logins slow down, from usually taking <0.05s to
>> taking minutes. This is for psql as a normal user using Kerberos, for
>> psql as the postgres superuser, for the web-application logging into the
>> database, for everything.
>
> I would suspect that the problem is with the authentication method.

Interesting, how are you suspecting the interaction to be playing out?

> How is authenticatoin configured?

Primarily using gss (for the web-application, delegated credentials),
md5 (system users), and a few trust (for legacy access).

> I'd "strace" a connection that is hanging in "authenticating" and
> see what the process does.

I did that one of the previous times I have seen this problem, and saw
stuff like this:

  2020-06-10 11:22:14    strace: Process 101559 attached
  2020-06-10 11:22:14    futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0
    ...
  2020-06-10 11:24:09    futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0
  2020-06-10 11:24:09    futex(0x7f46df2a2e38, FUTEX_WAKE, 1)    = 1
  2020-06-10 11:24:09    brk(0x55c197084000)                     = 0x55c197084000
  2020-06-10 11:24:09    brk(0x55c1970b0000)                     = 0x55c1970b0000
  2020-06-10 11:24:09    open("global/pg_filenode.map", O_RDONLY) = 6

(where the elided part is futex() calls).

The slow login is regardless of whether I login with Kerberos or as the
postgres superuser with psql.


  Best regards,

    Adam

--
 "Sometimes I feel like going home, walking in the          Adam Sjøgren
  boardroom                                            asjo@koldfront.dk
  Sometimes I feel like I wanna go, walk into the
  ballroom"



Stephen Frost <sfrost@snowman.net> writes:
> * Adam Sjøgren (asjo@koldfront.dk) wrote:
>> Sometimes new database logins slow down, from usually taking <0.05s to
>> taking minutes. This is for psql as a normal user using Kerberos, for
>> psql as the postgres superuser, for the web-application logging into the
>> database, for everything.

> When in doubt, blame DNS.

The trouble with outside-the-database explanations is that it's not
clear why stopping the worker processes would make such issues go away.

I think looking into pg_stat_activity and pg_locks might be interesting,
although it's possible the delay is before the new session has made
any entries there.

Also, obtaining stack traces from a few of the stuck processes would
likely be quite informative.

            regards, tom lane



Stephen writes:

> * Adam Sjøgren (asjo@koldfront.dk) wrote:
>>
>> We have 60 processes (workers) running on different machines accessing
>> the database, that all grab jobs from a queue and update rows in a table
>> after doing some calculations (which vary in time from <1s to perhaps a
>> minute, many of them fast).
>>
>> Sometimes new database logins slow down, from usually taking <0.05s to
>> taking minutes. This is for psql as a normal user using Kerberos, for
>> psql as the postgres superuser, for the web-application logging into the
>> database, for everything.
>
> When in doubt, blame DNS.

I'd love to! However I don't see any DNS outages on our local network
correlating with whether I run 60 workers or 5 workers.

> Alternatively, in your case, the issue might be the KDC taking forever
> to issue a ticket for the service.

If that was the cause, logging in as the 'postgres' superuser (not using
Kerberos) locally on the server should be fast regardless, right?

> (though you might check if you have log_hostnames on..).

It's off:

  $ grep hostname /etc/postgresql/11/main/postgresql.conf
  #log_hostname = off


Thanks for replying!


  Best regards,

    Adam

--
 "Probabilistic algorithms don't appeal to me. (This        Adam Sjøgren
  is a question of aesthetics, not practicality.) So   asjo@koldfront.dk
  later, I figured out how to remove the probability
  and turn it into a deterministic algorithm."





On Tue, Aug 18, 2020, 8:04 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,

* Adam Sjøgren (asjo@koldfront.dk) wrote:
> We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
> sometimes exhibits a behaviour I can't quite understand: simply logging
> into the database starts to take minutes to complete.
>
> We have 60 processes (workers) running on different machines accessing
> the database, that all grab jobs from a queue and update rows in a table
> after doing some calculations (which vary in time from <1s to perhaps a
> minute, many of them fast).
>
> Sometimes new database logins slow down, from usually taking <0.05s to
> taking minutes. This is for psql as a normal user using Kerberos, for
> psql as the postgres superuser, for the web-application logging into the
> database, for everything.

When in doubt, blame DNS.

Alternatively, in your case, the issue might be the KDC taking forever
to issue a ticket for the service.

Note that, in both cases, this is a client-side issue.

If it is a client side issue, how do you explain the behaviour that OP is not facing problem when there are only few clients?

Regards,
Seenu.
  Once the ticket
has been acquired for the PG service, the actual authentication on the
server side should be very fast and not depend on things external
(though you might check if you have log_hostnames on..).

Thanks,

Stephen
Tom writes:

> Stephen Frost <sfrost@snowman.net> writes:
>> * Adam Sjøgren (asjo@koldfront.dk) wrote:
>>> Sometimes new database logins slow down, from usually taking <0.05s to
>>> taking minutes. This is for psql as a normal user using Kerberos, for
>>> psql as the postgres superuser, for the web-application logging into the
>>> database, for everything.
>
> I think looking into pg_stat_activity and pg_locks might be interesting,
> although it's possible the delay is before the new session has made
> any entries there.
>
> Also, obtaining stack traces from a few of the stuck processes would
> likely be quite informative.

I will try to look at pg_stat_activity and pg_locks the next time - it
just happened now and I focused on getting some stacktraces -
unfortunately I do not have the -dbg package installed, so I don't know
how useful they are.

Here's the first one:

  $ ps faux | grep postgres | grep authentication
  postgres  16697  0.0  0.0 34686088 8352 ?       Ss   14:45   0:00  \_ postgres: 11/main: workqueue XXdb
127.0.0.1(42534)authentication 
  postgres  16701  0.0  0.0 34686088 6476 ?       Ss   14:45   0:00  \_ postgres: 11/main: anonymous XXdb
127.0.0.1(42536)authentication 
  postgres  16702  0.0  0.0 34686088 8348 ?       Ss   14:45   0:00  \_ postgres: 11/main: svc_user XXdb
127.0.0.1(42540)authentication 
  postgres  16704  0.0  0.0 34686040 5376 ?       Ss   14:45   0:00  \_ postgres: 11/main: postgres postgres [local]
authentication
  postgres  16705  0.0  0.0 34686088 8352 ?       Ss   14:45   0:00  \_ postgres: 11/main: svc_user XXdb
127.0.0.1(42542)authentication 
  postgres  16708  0.0  0.0 34686088 6476 ?       Ss   14:45   0:00  \_ postgres: 11/main: svc_user XXdb
127.0.0.1(42544)authentication 
  $ sudo gdb -p  16708
  GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
    [...]
  Attaching to process 16708
  Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging symbols found)...done.
    [...]
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    ../sysdeps/unix/syscall-template.S: No such file or directory.
  (gdb) bt
  #0  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  #1  0x000055f60d3d0101 in WaitEventSetWait ()
  #2  0x000055f60d2dccff in secure_read ()
  #3  0x000055f60d2e9138 in ?? ()
  #4  0x000055f60d2e9e75 in pq_getbyte ()
  #5  0x000055f60d3f3100 in PostgresMain ()
  #6  0x000055f60d102bf8 in ?? ()
  #7  0x000055f60d37e492 in PostmasterMain ()
  #8  0x000055f60d103fa5 in main ()
  (gdb)

I think I was too slow on this one, and this is what it looks like if
the process is just waiting for a connection (as you can guess, I'm not
used to looking at backtraces).

Here is another one, which hopefully is caught at a better moment, and
more useful:

  postgres  19495  0.0  0.0 34686088 8412 ?       Ss   14:48   0:00  \_ postgres: 11/main: workqueue XXdb
127.0.0.1(43390)authentication 
  postgres  19497  0.0  0.0 34686088 6476 ?       Ss   14:48   0:00  \_ postgres: 11/main: XXX XXdb 127.0.0.1(43392)
authentication
  postgres  19502  0.0  0.0 34686088 8412 ?       Ss   14:48   0:00  \_ postgres: 11/main: XXX XXdb 127.0.0.1(43394)
authentication
  $ sudo gdb -p  19495
  GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
    [...]
  Attaching to process 19495
    [...]
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  (gdb) bt

  Program received signal SIGUSR1, User defined signal 1.
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    ../sysdeps/unix/syscall-template.S: No such file or directory.
  (gdb) bt
  #0  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  #1  0x000055f60d3d0101 in WaitEventSetWait ()
  #2  0x000055f60d3d0574 in WaitLatchOrSocket ()
  #3  0x000055f60d3e0b41 in ProcSleep ()
  #4  0x000055f60d3db249 in ?? ()
  #5  0x000055f60d3dc591 in LockAcquireExtended ()
  #6  0x000055f60d3da365 in LockTuple ()
  #7  0x000055f60d144ea6 in heap_update ()
  #8  0x000055f60d2c59ff in ?? ()
  #9  0x000055f60d2c609b in ?? ()
  #10 0x000055f60d2a0d8b in standard_ExecutorRun ()
  #11 0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so
  ---Type <return> to continue, or q <return> to quit---
  #12 0x000055f60d3f5cd5 in ?? ()
  #13 0x000055f60d3f5f28 in ?? ()
  #14 0x000055f60d3f6be5 in PortalRun ()
  #15 0x000055f60d3f26a1 in ?? ()
  #16 0x000055f60d3f3bcb in PostgresMain ()
  #17 0x000055f60d102bf8 in ?? ()
  #18 0x000055f60d37e492 in PostmasterMain ()
  #19 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  (gdb) bt
  (gdb) cont
  (gdb) cont
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    in ../sysdeps/unix/syscall-template.S
  (gdb) bt
  #0  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  #1  0x000055f60d3d0101 in WaitEventSetWait ()
  #2  0x000055f60d2dccff in secure_read ()
  #3  0x000055f60d2e9138 in ?? ()
  #4  0x000055f60d2e9e75 in pq_getbyte ()
  #5  0x000055f60d3f3100 in PostgresMain ()
  #6  0x000055f60d102bf8 in ?? ()
  #7  0x000055f60d37e492 in PostmasterMain ()
  #8  0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    in ../sysdeps/unix/syscall-template.S
  (gdb) bt
  #0  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  #1  0x000055f60d3d0101 in WaitEventSetWait ()
  #2  0x000055f60d2dccff in secure_read ()
  #3  0x000055f60d2e9138 in ?? ()
  #4  0x000055f60d2e9e75 in pq_getbyte ()
  #5  0x000055f60d3f3100 in PostgresMain ()
  #6  0x000055f60d102bf8 in ?? ()
  #7  0x000055f60d37e492 in PostmasterMain ()
  #8  0x000055f60d103fa5 in main ()
  (gdb) quit
  A debugging session is active.

          Inferior 1 [process 19495] will be detached.

  Quit anyway? (y or n) n
  Not confirmed.
  (gdb) detach
  Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 19495
  (gdb) quit

Is this helpful, or should I do something different?

Here is a final one:

  postgres  21257  0.0  0.0 34686088 6476 ?       Ss   14:52   0:00  \_ postgres: 11/main: svc_user XXdb
127.0.0.1(44192)authentication 
  $ sudo gdb -p 21257
    [...]
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2ea001b8)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2ea001b8)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2ea001b8, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2ea001b8, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  bt

  Program received signal SIGUSR1, User defined signal 1.
  0x000055f60d10e50e in slot_getsomeattrs ()
  (gdb) bt
  #0  0x000055f60d10e50e in slot_getsomeattrs ()
  #1  0x000055f60d29d260 in ?? ()
  #2  0x000055f60d2a9965 in ExecScan ()
  #3  0x000055f60d2b7b2a in ?? ()
  #4  0x000055f60d2c71e6 in ?? ()
  #5  0x000055f60d2a0d8b in standard_ExecutorRun ()
  #6  0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so
  #7  0x000055f60d3f571b in ?? ()
  #8  0x000055f60d3f6d40 in PortalRun ()
  #9  0x000055f60d3f26a1 in ?? ()
  #10 0x000055f60d3f3bcb in PostgresMain ()
  #11 0x000055f60d102bf8 in ?? ()
  #12 0x000055f60d37e492 in PostmasterMain ()
  #13 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.

  Program received signal SIGUSR1, User defined signal 1.
  0x000055f60d10e50e in slot_getsomeattrs ()
  (gdb) bt
  #0  0x000055f60d10e50e in slot_getsomeattrs ()
  #1  0x000055f60d29d260 in ?? ()
  #2  0x000055f60d2a9965 in ExecScan ()
  #3  0x000055f60d2b7b2a in ?? ()
  #4  0x000055f60d2c71e6 in ?? ()
  #5  0x000055f60d2a0d8b in standard_ExecutorRun ()
  #6  0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so
  #7  0x000055f60d3f571b in ?? ()
  #8  0x000055f60d3f6d40 in PortalRun ()
  #9  0x000055f60d3f26a1 in ?? ()
  #10 0x000055f60d3f3bcb in PostgresMain ()
  #11 0x000055f60d102bf8 in ?? ()
  #12 0x000055f60d37e492 in PostmasterMain ()
  #13 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  (gdb) bt
  [Inferior 1 (process 21257) exited normally]
  (gdb) bt
  No stack.
  (gdb) cont
  The program is not being run.
  (gdb) quit

Let me know if there is something else I should try.


  Best regards,

    Adam

--
 "It is a sort of cheap and cheerful kind of                Adam Sjøgren
  abstraction, but it works well in practise."         asjo@koldfront.dk



Adam writes:

> I will try to look at pg_stat_activity and pg_locks the next time

I did not have to wait long, so I have captured the output of
pg_stat_activity and pg_locks when everything is fine, by just doing
"time sudo -u postgres PAGER=cat psql template1 -c 'SELECT * FROM
pg_stat_activity'; time sudo -u postgres PAGER=cat psql template1 -c
'SELECT * FROM pg_locks'":

 · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-OK.txt.gz

And while it was bad, and I shut down the workers (notice how the first
psql took 23s compared to 0.112s when everything was fine):

 · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

(I have "sanitized" both files for usernames, application names,
IP-addresses and queries.)

I also attached gdb to one of the processes and got a backtrace, gdb
continued, pressed ^C after ~5s, got a backtrace and kept doing that for
a while, this is the result:

  postgres  79572  0.0  0.0 34686088 8412 ?       Ss   15:15   0:00  \_ postgres: 11/main: workqueue XXdb
127.0.0.1(53126)authentication 
  postgres  79573  0.0  0.0 34686088 6476 ?       Ss   15:15   0:00  \_ postgres: 11/main: anonymous XXdb
127.0.0.1(53128)authentication 
  $ sudo gdb -p 79572
  GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
     [...]
  Attaching to process 79572
     [...]
  Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging symbols found)...done.
     [...]
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  205    in ../sysdeps/unix/sysv/linux/futex-internal.h
  (gdb) bt
  #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
  #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111
  #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181
  #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
  #4  0x000055f60d36c252 in PGSemaphoreLock ()
  #5  0x000055f60d3e35cc in LWLockAcquire ()
  #6  0x000055f60d3d14ee in ProcArrayAdd ()
  #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
  #8  0x000055f60d51c99b in InitPostgres ()
  #9  0x000055f60d3f2baf in PostgresMain ()
  #10 0x000055f60d102bf8 in ?? ()
  #11 0x000055f60d37e492 in PostmasterMain ()
  #12 0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    ../sysdeps/unix/syscall-template.S: No such file or directory.
  (gdb) bt
  #0  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  #1  0x000055f60d3d0101 in WaitEventSetWait ()
  #2  0x000055f60d2dccff in secure_read ()
  #3  0x000055f60d2e9138 in ?? ()
  #4  0x000055f60d2e9e75 in pq_getbyte ()
  #5  0x000055f60d3f3100 in PostgresMain ()
  #6  0x000055f60d102bf8 in ?? ()
  #7  0x000055f60d37e492 in PostmasterMain ()
  #8  0x000055f60d103fa5 in main ()
  (gdb) cont
  Continuing.
  ^C
  Program received signal SIGINT, Interrupt.
  0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86
  86    in ../sysdeps/unix/syscall-template.S
  (gdb) detach
  Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 79572
  (gdb) quit

I hope this is useful. Let me know if there are any parameters I should
try adjusting, or some other information I should try to extract. Thanks!


  Best regards,

    Adam

--
 "Why do you ask so much?"                                  Adam Sjøgren
 "I'm a kid. It's what we do."                         asjo@koldfront.dk



=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes:
> Tom writes:
>> * Adam Sjøgren (asjo@koldfront.dk) wrote:
>>> Sometimes new database logins slow down, from usually taking <0.05s to
>>> taking minutes. This is for psql as a normal user using Kerberos, for
>>> psql as the postgres superuser, for the web-application logging into the
>>> database, for everything.

>> Also, obtaining stack traces from a few of the stuck processes would
>> likely be quite informative.

> I will try to look at pg_stat_activity and pg_locks the next time - it
> just happened now and I focused on getting some stacktraces -
> unfortunately I do not have the -dbg package installed, so I don't know
> how useful they are.

Frequently, stack traces taken without debug symbols are pretty useless,
but you seem to have got lucky.  It seems clear that the unexpected wait
is here:

>   (gdb) bt
>   #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
>   #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
>   #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
>   #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
>   #4  0x000055f60d36c252 in PGSemaphoreLock ()
>   #5  0x000055f60d3e35cc in LWLockAcquire ()
>   #6  0x000055f60d3d14ee in ProcArrayAdd ()
>   #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
>   #8  0x000055f60d51c99b in InitPostgres ()
>   #9  0x000055f60d3f2baf in PostgresMain ()
>   #10 0x000055f60d102bf8 in ?? ()
>   #11 0x000055f60d37e492 in PostmasterMain ()
>   #12 0x000055f60d103fa5 in main ()

which implies that the problem is unexpectedly high contention for the
ProcArrayLock.  It's unsurprising that incoming processes would need
that lock, since they need to add themselves to the ProcArray.  What's
not so apparent is why there is so much activity as to block them
from doing so for a significant amount of time.  We can guess that
the "worker processes" you mentioned are ganging up on that lock
because they need to compute query snapshots, but that still doesn't
seem like enough of an explanation.

Can you show us exactly what the "worker processes" are doing while
this is going on?

            regards, tom lane



Tom writes:

> Frequently, stack traces taken without debug symbols are pretty useless,
> but you seem to have got lucky.

For once, hah!

> It seems clear that the unexpected wait is here:
>
>>   (gdb) bt
>>   #0  0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0,
futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 
>>   #1  do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111
>>   #2  0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181
>>   #3  0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
>>   #4  0x000055f60d36c252 in PGSemaphoreLock ()
>>   #5  0x000055f60d3e35cc in LWLockAcquire ()
>>   #6  0x000055f60d3d14ee in ProcArrayAdd ()
>>   #7  0x000055f60d3e01a3 in InitProcessPhase2 ()
>>   #8  0x000055f60d51c99b in InitPostgres ()
>>   #9  0x000055f60d3f2baf in PostgresMain ()
>>   #10 0x000055f60d102bf8 in ?? ()
>>   #11 0x000055f60d37e492 in PostmasterMain ()
>>   #12 0x000055f60d103fa5 in main ()
>
> which implies that the problem is unexpectedly high contention for the
> ProcArrayLock.  It's unsurprising that incoming processes would need
> that lock, since they need to add themselves to the ProcArray.  What's
> not so apparent is why there is so much activity as to block them
> from doing so for a significant amount of time.  We can guess that
> the "worker processes" you mentioned are ganging up on that lock
> because they need to compute query snapshots, but that still doesn't
> seem like enough of an explanation.
>
> Can you show us exactly what the "worker processes" are doing while
> this is going on?

One thing I should have mentioned, but forgot, is that the database is
configured to do logical replication to another machine - could that
have an effect on the lock contention?

A colleague pointed out that in the pg_locks output, the replication
processes are waiting on the ProcArray lock:

 · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

I can try turning off replication the next time I see the problem.

The workers basically loop on:

 * SELECT a job from a job-table, UPDATE it
 * Take an advisory lock on the relevant row in the calculations table
 * Do some work, at least a couple of SELECTs from various tables
 * UPDATE or INSERT a row into the calculations table, release the lock
 * INSERT a row in an exitstatus table
 * DELETE the job

The calculations can be sub-second to several minutes, the majority
mostly in the shorter end. As this is happening there are millions of
jobs in the job-table.

When the problem occurs, the worker logfiles indicate that they are
moving along, still working.

To give a sense of scale, here are the number of transactions per second
plotted:

 · https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png

(I think the hole in the graph is when Munin timed out getting the
information, and the number of transactions dropped as I stopped most of
the workers.)


  Best regards,

    Adam

--
 "Examination and mastering of a new highly                 Adam Sjøgren
  intellectual equipment was a hard labour."           asjo@koldfront.dk



=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes:
> Tom writes:
>> ... which implies that the problem is unexpectedly high contention for the
>> ProcArrayLock.

> One thing I should have mentioned, but forgot, is that the database is
> configured to do logical replication to another machine - could that
> have an effect on the lock contention?
> A colleague pointed out that in the pg_locks output, the replication
> processes are waiting on the ProcArray lock:
>  · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz

Yeah, that is *mighty* interesting.  For the archives' sake, I'll point
out that this shows 16 walsender processes, of which 6 are in
WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
are waiting for the ProcArrayLock.

There are also three backends in ProcArrayGroupUpdate state, confirming
that we have a lot of contention for the ProcArrayLock ... but
interestingly, no non-walsender processes appear to be blocked directly
on the ProcArrayLock.  In particular, I *don't* see a backend that could
plausibly be the leader of that group commit; the other sessions showing
non-SELECT activity are waiting for assorted heavyweight locks, and they
all look like they're still running their transactions, not trying to
commit.  Perhaps this just reflects the fact that pg_stat_activity isn't
an instantaneous snapshot, though?

It's also evident that there's some parallel query activity going on,
as some sessions are in BgWorkerShutdown waits; so parallel worker
entry/exit might be contributing to the overall load on ProcArrayLock.

Anyway, we still don't know more than that the problem is ProcArrayLock
contention; it seems that the walsenders are involved, but we don't
really know if they're causing the problem or are just more victims.

One thing I'm curious about is whether you have walsenders starting
or stopping on a regular basis?  The ones visible here seem to all
have been running for at least an hour or so, but in a quick look
through the code it seemed like much of the walsender code that
would have any reason to take the ProcArrayLock is in walsender
startup or shutdown.

Another angle that is worth a try is to see whether you can reduce
the number of walsenders, and then see if that ameliorates the issue.
I'm not totally sure but I think we support cascaded replication,
so that you could reduce the replication load on the primary just
by reconfiguring some of the standbys to pull from other ones
not directly from the primary.

            regards, tom lane



Tom writes:

> =?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes:
>> Tom writes:
>>> ... which implies that the problem is unexpectedly high contention for the
>>> ProcArrayLock.
>
>> One thing I should have mentioned, but forgot, is that the database is
>> configured to do logical replication to another machine - could that
>> have an effect on the lock contention?
>> A colleague pointed out that in the pg_locks output, the replication
>> processes are waiting on the ProcArray lock:
>>  · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz
>
> Yeah, that is *mighty* interesting.  For the archives' sake, I'll point
> out that this shows 16 walsender processes, of which 6 are in
> WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
> are waiting for the ProcArrayLock.

Sorry for not returning on this before now.

I have seen the slow login problem - ProcArrayLock contention - happen
with replication turned off, so I think that can be ruled out as the
cause.

I just observed something interesting right now, however.

Just to recap the scene: the basic setup is that we have a table with a
queue of jobs to be processed, and we have a number of programs (say 10)
on 6 other machines grabbing those jobs, doing calculations and
reporting back.

The number of jobs at any given time can fluctuate from none to
millions. Typically millions of jobs take some days to a week to get
through.

Now, I happened to have a psql prompt open when the problem of new
logins being slow appeared - and I ran a query that counted the number
of jobs by job-type.

Before the slow login problem, that query took around 70-80 ms, but when
the slow login appeared, it took 20+ seconds to run the query.

EXPLAIN ANALYZE showed that it was running a parallel query.

So I tried turning parallel queries off - with SET
max_parallel_workers_per_gather = 0; - and now the query took around 150
ms consistenly, despite logins still being (variably) slow.

So that confirms the ProcArrayLock contention, I think.

My next thought is: I have 6*10 programs making connections and doing
parallel queries over the job queue, to get new jobs. If the jobs happen
to be very "easy", those programs will be doing a lot of queries for
jobs. And when the queue is large enough (I'm guessing), those queries
will be parallel, so they will be creating new processes that need to be
added and removed from the ProcArray.

So am I shooting myself in the foot here, by creating lock contention
when trying to get jobs quickly from a largeish table?

I think my next step will be to modify the programs that grab the jobs
to avoid parallel queries when looking up the next job to run.

You anticipated this, it just took me until now to get a little closer:

> It's also evident that there's some parallel query activity going on,
> as some sessions are in BgWorkerShutdown waits; so parallel worker
> entry/exit might be contributing to the overall load on ProcArrayLock.


  Best regards,

    Adam

--
 "That's one of the remarkable things about life.           Adam Sjøgren
  It's never so bad that it can't get worse."          asjo@koldfront.dk



On 9/30/20 1:22 PM, Adam Sjøgren wrote:
> Tom writes:
> 
>> =?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes:
>>> Tom writes:
>>>> ... which implies that the problem is unexpectedly high contention for the
>>>> ProcArrayLock.
>>
>>> One thing I should have mentioned, but forgot, is that the database is
>>> configured to do logical replication to another machine - could that
>>> have an effect on the lock contention?
>>> A colleague pointed out that in the pg_locks output, the replication
>>> processes are waiting on the ProcArray lock:
>>>   · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz
>>
>> Yeah, that is *mighty* interesting.  For the archives' sake, I'll point
>> out that this shows 16 walsender processes, of which 6 are in
>> WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10
>> are waiting for the ProcArrayLock.
> 
> Sorry for not returning on this before now.
> 
> I have seen the slow login problem - ProcArrayLock contention - happen
> with replication turned off, so I think that can be ruled out as the
> cause.
> 
> I just observed something interesting right now, however.
> 
> Just to recap the scene: the basic setup is that we have a table with a
> queue of jobs to be processed, and we have a number of programs (say 10)
> on 6 other machines grabbing those jobs, doing calculations and
> reporting back.

I don't have an answer. Not even sure if this is relevant to the 
problem, but how are the jobs getting into the queue?

> 
> The number of jobs at any given time can fluctuate from none to
> millions. Typically millions of jobs take some days to a week to get
> through.
> 
> Now, I happened to have a psql prompt open when the problem of new
> logins being slow appeared - and I ran a query that counted the number
> of jobs by job-type.
> 
> Before the slow login problem, that query took around 70-80 ms, but when
> the slow login appeared, it took 20+ seconds to run the query.
> 
> EXPLAIN ANALYZE showed that it was running a parallel query.
> 
> So I tried turning parallel queries off - with SET
> max_parallel_workers_per_gather = 0; - and now the query took around 150
> ms consistenly, despite logins still being (variably) slow.
> 
> So that confirms the ProcArrayLock contention, I think.
> 
> My next thought is: I have 6*10 programs making connections and doing
> parallel queries over the job queue, to get new jobs. If the jobs happen
> to be very "easy", those programs will be doing a lot of queries for
> jobs. And when the queue is large enough (I'm guessing), those queries
> will be parallel, so they will be creating new processes that need to be
> added and removed from the ProcArray.
> 
> So am I shooting myself in the foot here, by creating lock contention
> when trying to get jobs quickly from a largeish table?
> 
> I think my next step will be to modify the programs that grab the jobs
> to avoid parallel queries when looking up the next job to run.
> 
> You anticipated this, it just took me until now to get a little closer:
> 
>> It's also evident that there's some parallel query activity going on,
>> as some sessions are in BgWorkerShutdown waits; so parallel worker
>> entry/exit might be contributing to the overall load on ProcArrayLock.
> 
> 
>    Best regards,
> 
>      Adam
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



Adrian writes:

> I don't have an answer. Not even sure if this is relevant to the
> problem, but how are the jobs getting into the queue?

Plain INSERTs - often a lot at the same time. I inserted 400K jobs to
clean up after a bug earlier today, for instance.

What were you suspecting?


  Best regards,

    Adam

--
 "Don't think twice                                         Adam Sjøgren
  It's all right"                                      asjo@koldfront.dk



On 9/30/20 2:30 PM, Adam Sjøgren wrote:
> Adrian writes:
> 
>> I don't have an answer. Not even sure if this is relevant to the
>> problem, but how are the jobs getting into the queue?
> 
> Plain INSERTs - often a lot at the same time. I inserted 400K jobs to
> clean up after a bug earlier today, for instance.
> 
> What were you suspecting?

Honestly a fishing expedition. All the discussion, as far as I could 
remember, had to do with the retrieval part of the process. Just thought 
for completeness it might be useful to flesh out the insertion portion 
of the process. Though it does get one to wondering how the appearance 
of a large number of jobs at once affects the those programs looking for 
new jobs?

> 
> 
>    Best regards,
> 
>      Adam
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



On Wed, Sep 30, 2020 at 3:41 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 9/30/20 2:30 PM, Adam Sjøgren wrote:
> Adrian writes:
>
>> I don't have an answer. Not even sure if this is relevant to the
>> problem, but how are the jobs getting into the queue?
>
> Plain INSERTs - often a lot at the same time. I inserted 400K jobs to
> clean up after a bug earlier today, for instance.
>
> What were you suspecting?

Honestly a fishing expedition. All the discussion, as far as I could
remember, had to do with the retrieval part of the process. Just thought
for completeness it might be useful to flesh out the insertion portion
of the process. Though it does get one to wondering how the appearance
of a large number of jobs at once affects the those programs looking for
new jobs?

Is autovacuum/analyze keeping up with the changes to allow the planner to make prudent choices? With so many updates/deletes, I would hope so but if still using the 20% or 10% default values for scale_factor, and at times the table has many millions of rows, perhaps not.