Thread: Database logins taking longer and longer, showing up as "authentication" in ps(1)
Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Hi, We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which sometimes exhibits a behaviour I can't quite understand: simply logging into the database starts to take minutes to complete. We have 60 processes (workers) running on different machines accessing the database, that all grab jobs from a queue and update rows in a table after doing some calculations (which vary in time from <1s to perhaps a minute, many of them fast). Sometimes new database logins slow down, from usually taking <0.05s to taking minutes. This is for psql as a normal user using Kerberos, for psql as the postgres superuser, for the web-application logging into the database, for everything. What I see in the output of ps(1) is a lot of postgres processes in state "authentication", when this problem occurs. It "feels" like the logins are waiting for something - when e.g. psql finally logs in, running queries seem normal. The only "special" thing I can think of is that the workers use advisory locks on rows in the table they are updating (as there can be multiple jobs in flight at the same time, for the same row). I can imagine that a pattern of jobs and workers could make the locking cause those workers to wait and slow each other down, but I don't understand how it can affect the time it takes for new database logins. When this happens, if I stop the 60 workers, the login times are back to normal within seconds, and all the postgres processes showing "authentication" switch away from that state. If I start the workers again, the login problem appears within a couple of minutes. If I start a smaller number of workers, say 5, the problem doesn't appear. Could this be lock contention on a database table somehow affecting logins? I have tried searching for this problem, and found some older references to things² that seem to have been fixed in the meantime. Is anybody else seeing this? The server is configured with max_connections 1200 and currently has ~400 connections. It has 1.5 TB of memory and 112 cores (including hyperthreading) - the load is not high when I see this problem, between 20 and 30. Any ideas and suggestions are welcome - if there is some relevant information I can provide, I will be happy to try. Best regards, Adam ¹ We should plan to upgrade to 11.9, yes. ² E.g. https://www.postgresql.org/message-id/CAJw8uJQGoobesbPCMbxj6Vb4nv9D-GgvZ%2B7pK%2Bfckbb4DqJEAg%40mail.gmail.com and https://www.postgresql.org/message-id/20140312232810.634.7997%40wrigleys.postgresql.org -- "We don't care about crashing our cars Adam Sjøgren Pedal to the metal and our heads among the stars" asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Stephen Frost
Date:
Greetings, * Adam Sjøgren (asjo@koldfront.dk) wrote: > We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which > sometimes exhibits a behaviour I can't quite understand: simply logging > into the database starts to take minutes to complete. > > We have 60 processes (workers) running on different machines accessing > the database, that all grab jobs from a queue and update rows in a table > after doing some calculations (which vary in time from <1s to perhaps a > minute, many of them fast). > > Sometimes new database logins slow down, from usually taking <0.05s to > taking minutes. This is for psql as a normal user using Kerberos, for > psql as the postgres superuser, for the web-application logging into the > database, for everything. When in doubt, blame DNS. Alternatively, in your case, the issue might be the KDC taking forever to issue a ticket for the service. Note that, in both cases, this is a client-side issue. Once the ticket has been acquired for the PG service, the actual authentication on the server side should be very fast and not depend on things external (though you might check if you have log_hostnames on..). Thanks, Stephen
Attachment
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Laurenz writes: > On Tue, 2020-08-18 at 14:53 +0200, Adam Sjøgren wrote: >> >> We have 60 processes (workers) running on different machines accessing >> the database, that all grab jobs from a queue and update rows in a table >> after doing some calculations (which vary in time from <1s to perhaps a >> minute, many of them fast). >> >> Sometimes new database logins slow down, from usually taking <0.05s to >> taking minutes. This is for psql as a normal user using Kerberos, for >> psql as the postgres superuser, for the web-application logging into the >> database, for everything. > > I would suspect that the problem is with the authentication method. Interesting, how are you suspecting the interaction to be playing out? > How is authenticatoin configured? Primarily using gss (for the web-application, delegated credentials), md5 (system users), and a few trust (for legacy access). > I'd "strace" a connection that is hanging in "authenticating" and > see what the process does. I did that one of the previous times I have seen this problem, and saw stuff like this: 2020-06-10 11:22:14 strace: Process 101559 attached 2020-06-10 11:22:14 futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0 ... 2020-06-10 11:24:09 futex(0x7f46df29e138, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0 2020-06-10 11:24:09 futex(0x7f46df2a2e38, FUTEX_WAKE, 1) = 1 2020-06-10 11:24:09 brk(0x55c197084000) = 0x55c197084000 2020-06-10 11:24:09 brk(0x55c1970b0000) = 0x55c1970b0000 2020-06-10 11:24:09 open("global/pg_filenode.map", O_RDONLY) = 6 (where the elided part is futex() calls). The slow login is regardless of whether I login with Kerberos or as the postgres superuser with psql. Best regards, Adam -- "Sometimes I feel like going home, walking in the Adam Sjøgren boardroom asjo@koldfront.dk Sometimes I feel like I wanna go, walk into the ballroom"
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes: > * Adam Sjøgren (asjo@koldfront.dk) wrote: >> Sometimes new database logins slow down, from usually taking <0.05s to >> taking minutes. This is for psql as a normal user using Kerberos, for >> psql as the postgres superuser, for the web-application logging into the >> database, for everything. > When in doubt, blame DNS. The trouble with outside-the-database explanations is that it's not clear why stopping the worker processes would make such issues go away. I think looking into pg_stat_activity and pg_locks might be interesting, although it's possible the delay is before the new session has made any entries there. Also, obtaining stack traces from a few of the stuck processes would likely be quite informative. regards, tom lane
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Stephen writes: > * Adam Sjøgren (asjo@koldfront.dk) wrote: >> >> We have 60 processes (workers) running on different machines accessing >> the database, that all grab jobs from a queue and update rows in a table >> after doing some calculations (which vary in time from <1s to perhaps a >> minute, many of them fast). >> >> Sometimes new database logins slow down, from usually taking <0.05s to >> taking minutes. This is for psql as a normal user using Kerberos, for >> psql as the postgres superuser, for the web-application logging into the >> database, for everything. > > When in doubt, blame DNS. I'd love to! However I don't see any DNS outages on our local network correlating with whether I run 60 workers or 5 workers. > Alternatively, in your case, the issue might be the KDC taking forever > to issue a ticket for the service. If that was the cause, logging in as the 'postgres' superuser (not using Kerberos) locally on the server should be fast regardless, right? > (though you might check if you have log_hostnames on..). It's off: $ grep hostname /etc/postgresql/11/main/postgresql.conf #log_hostname = off Thanks for replying! Best regards, Adam -- "Probabilistic algorithms don't appeal to me. (This Adam Sjøgren is a question of aesthetics, not practicality.) So asjo@koldfront.dk later, I figured out how to remove the probability and turn it into a deterministic algorithm."
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Srinivasa T N
Date:
On Tue, Aug 18, 2020, 8:04 PM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Adam Sjøgren (asjo@koldfront.dk) wrote:
> We have a PostgreSQL 11.3¹ running on an Ubuntu 16.04 server, which
> sometimes exhibits a behaviour I can't quite understand: simply logging
> into the database starts to take minutes to complete.
>
> We have 60 processes (workers) running on different machines accessing
> the database, that all grab jobs from a queue and update rows in a table
> after doing some calculations (which vary in time from <1s to perhaps a
> minute, many of them fast).
>
> Sometimes new database logins slow down, from usually taking <0.05s to
> taking minutes. This is for psql as a normal user using Kerberos, for
> psql as the postgres superuser, for the web-application logging into the
> database, for everything.
When in doubt, blame DNS.
Alternatively, in your case, the issue might be the KDC taking forever
to issue a ticket for the service.
Note that, in both cases, this is a client-side issue.
If it is a client side issue, how do you explain the behaviour that OP is not facing problem when there are only few clients?
Regards,
Seenu.
Once the ticket
has been acquired for the PG service, the actual authentication on the
server side should be very fast and not depend on things external
(though you might check if you have log_hostnames on..).
Thanks,
Stephen
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Tom writes: > Stephen Frost <sfrost@snowman.net> writes: >> * Adam Sjøgren (asjo@koldfront.dk) wrote: >>> Sometimes new database logins slow down, from usually taking <0.05s to >>> taking minutes. This is for psql as a normal user using Kerberos, for >>> psql as the postgres superuser, for the web-application logging into the >>> database, for everything. > > I think looking into pg_stat_activity and pg_locks might be interesting, > although it's possible the delay is before the new session has made > any entries there. > > Also, obtaining stack traces from a few of the stuck processes would > likely be quite informative. I will try to look at pg_stat_activity and pg_locks the next time - it just happened now and I focused on getting some stacktraces - unfortunately I do not have the -dbg package installed, so I don't know how useful they are. Here's the first one: $ ps faux | grep postgres | grep authentication postgres 16697 0.0 0.0 34686088 8352 ? Ss 14:45 0:00 \_ postgres: 11/main: workqueue XXdb 127.0.0.1(42534)authentication postgres 16701 0.0 0.0 34686088 6476 ? Ss 14:45 0:00 \_ postgres: 11/main: anonymous XXdb 127.0.0.1(42536)authentication postgres 16702 0.0 0.0 34686088 8348 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42540)authentication postgres 16704 0.0 0.0 34686040 5376 ? Ss 14:45 0:00 \_ postgres: 11/main: postgres postgres [local] authentication postgres 16705 0.0 0.0 34686088 8352 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42542)authentication postgres 16708 0.0 0.0 34686088 6476 ? Ss 14:45 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(42544)authentication $ sudo gdb -p 16708 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 [...] Attaching to process 16708 Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging symbols found)...done. [...] 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x000055f60d3d0101 in WaitEventSetWait () #2 0x000055f60d2dccff in secure_read () #3 0x000055f60d2e9138 in ?? () #4 0x000055f60d2e9e75 in pq_getbyte () #5 0x000055f60d3f3100 in PostgresMain () #6 0x000055f60d102bf8 in ?? () #7 0x000055f60d37e492 in PostmasterMain () #8 0x000055f60d103fa5 in main () (gdb) I think I was too slow on this one, and this is what it looks like if the process is just waiting for a connection (as you can guess, I'm not used to looking at backtraces). Here is another one, which hopefully is caught at a better moment, and more useful: postgres 19495 0.0 0.0 34686088 8412 ? Ss 14:48 0:00 \_ postgres: 11/main: workqueue XXdb 127.0.0.1(43390)authentication postgres 19497 0.0 0.0 34686088 6476 ? Ss 14:48 0:00 \_ postgres: 11/main: XXX XXdb 127.0.0.1(43392) authentication postgres 19502 0.0 0.0 34686088 8412 ? Ss 14:48 0:00 \_ postgres: 11/main: XXX XXdb 127.0.0.1(43394) authentication $ sudo gdb -p 19495 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 [...] Attaching to process 19495 [...] 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. (gdb) bt Program received signal SIGUSR1, User defined signal 1. 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x000055f60d3d0101 in WaitEventSetWait () #2 0x000055f60d3d0574 in WaitLatchOrSocket () #3 0x000055f60d3e0b41 in ProcSleep () #4 0x000055f60d3db249 in ?? () #5 0x000055f60d3dc591 in LockAcquireExtended () #6 0x000055f60d3da365 in LockTuple () #7 0x000055f60d144ea6 in heap_update () #8 0x000055f60d2c59ff in ?? () #9 0x000055f60d2c609b in ?? () #10 0x000055f60d2a0d8b in standard_ExecutorRun () #11 0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so ---Type <return> to continue, or q <return> to quit--- #12 0x000055f60d3f5cd5 in ?? () #13 0x000055f60d3f5f28 in ?? () #14 0x000055f60d3f6be5 in PortalRun () #15 0x000055f60d3f26a1 in ?? () #16 0x000055f60d3f3bcb in PostgresMain () #17 0x000055f60d102bf8 in ?? () #18 0x000055f60d37e492 in PostmasterMain () #19 0x000055f60d103fa5 in main () (gdb) cont Continuing. (gdb) bt (gdb) cont (gdb) cont ^C Program received signal SIGINT, Interrupt. 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x000055f60d3d0101 in WaitEventSetWait () #2 0x000055f60d2dccff in secure_read () #3 0x000055f60d2e9138 in ?? () #4 0x000055f60d2e9e75 in pq_getbyte () #5 0x000055f60d3f3100 in PostgresMain () #6 0x000055f60d102bf8 in ?? () #7 0x000055f60d37e492 in PostmasterMain () #8 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x000055f60d3d0101 in WaitEventSetWait () #2 0x000055f60d2dccff in secure_read () #3 0x000055f60d2e9138 in ?? () #4 0x000055f60d2e9e75 in pq_getbyte () #5 0x000055f60d3f3100 in PostgresMain () #6 0x000055f60d102bf8 in ?? () #7 0x000055f60d37e492 in PostmasterMain () #8 0x000055f60d103fa5 in main () (gdb) quit A debugging session is active. Inferior 1 [process 19495] will be detached. Quit anyway? (y or n) n Not confirmed. (gdb) detach Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 19495 (gdb) quit Is this helpful, or should I do something different? Here is a final one: postgres 21257 0.0 0.0 34686088 6476 ? Ss 14:52 0:00 \_ postgres: 11/main: svc_user XXdb 127.0.0.1(44192)authentication $ sudo gdb -p 21257 [...] 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2ea001b8)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2ea001b8)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2ea001b8, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2ea001b8, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. bt Program received signal SIGUSR1, User defined signal 1. 0x000055f60d10e50e in slot_getsomeattrs () (gdb) bt #0 0x000055f60d10e50e in slot_getsomeattrs () #1 0x000055f60d29d260 in ?? () #2 0x000055f60d2a9965 in ExecScan () #3 0x000055f60d2b7b2a in ?? () #4 0x000055f60d2c71e6 in ?? () #5 0x000055f60d2a0d8b in standard_ExecutorRun () #6 0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so #7 0x000055f60d3f571b in ?? () #8 0x000055f60d3f6d40 in PortalRun () #9 0x000055f60d3f26a1 in ?? () #10 0x000055f60d3f3bcb in PostgresMain () #11 0x000055f60d102bf8 in ?? () #12 0x000055f60d37e492 in PostmasterMain () #13 0x000055f60d103fa5 in main () (gdb) cont Continuing. Program received signal SIGUSR1, User defined signal 1. 0x000055f60d10e50e in slot_getsomeattrs () (gdb) bt #0 0x000055f60d10e50e in slot_getsomeattrs () #1 0x000055f60d29d260 in ?? () #2 0x000055f60d2a9965 in ExecScan () #3 0x000055f60d2b7b2a in ?? () #4 0x000055f60d2c71e6 in ?? () #5 0x000055f60d2a0d8b in standard_ExecutorRun () #6 0x00007f8269670e45 in ?? () from /usr/lib/postgresql/11/lib/pg_stat_statements.so #7 0x000055f60d3f571b in ?? () #8 0x000055f60d3f6d40 in PortalRun () #9 0x000055f60d3f26a1 in ?? () #10 0x000055f60d3f3bcb in PostgresMain () #11 0x000055f60d102bf8 in ?? () #12 0x000055f60d37e492 in PostmasterMain () #13 0x000055f60d103fa5 in main () (gdb) cont Continuing. (gdb) bt [Inferior 1 (process 21257) exited normally] (gdb) bt No stack. (gdb) cont The program is not being run. (gdb) quit Let me know if there is something else I should try. Best regards, Adam -- "It is a sort of cheap and cheerful kind of Adam Sjøgren abstraction, but it works well in practise." asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Adam writes: > I will try to look at pg_stat_activity and pg_locks the next time I did not have to wait long, so I have captured the output of pg_stat_activity and pg_locks when everything is fine, by just doing "time sudo -u postgres PAGER=cat psql template1 -c 'SELECT * FROM pg_stat_activity'; time sudo -u postgres PAGER=cat psql template1 -c 'SELECT * FROM pg_locks'": · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-OK.txt.gz And while it was bad, and I shut down the workers (notice how the first psql took 23s compared to 0.112s when everything was fine): · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz (I have "sanitized" both files for usernames, application names, IP-addresses and queries.) I also attached gdb to one of the processes and got a backtrace, gdb continued, pressed ^C after ~5s, got a backtrace and kept doing that for a while, this is the result: postgres 79572 0.0 0.0 34686088 8412 ? Ss 15:15 0:00 \_ postgres: 11/main: workqueue XXdb 127.0.0.1(53126)authentication postgres 79573 0.0 0.0 34686088 6476 ? Ss 15:15 0:00 \_ postgres: 11/main: anonymous XXdb 127.0.0.1(53128)authentication $ sudo gdb -p 79572 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 [...] Attaching to process 79572 [...] Reading symbols from /usr/lib/postgresql/11/bin/postgres...(no debugging symbols found)...done. [...] 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 in ../sysdeps/unix/sysv/linux/futex-internal.h (gdb) bt #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f6d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:111 #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f6d38, abstime=0x0) at sem_waitcommon.c:181 #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 #4 0x000055f60d36c252 in PGSemaphoreLock () #5 0x000055f60d3e35cc in LWLockAcquire () #6 0x000055f60d3d14ee in ProcArrayAdd () #7 0x000055f60d3e01a3 in InitProcessPhase2 () #8 0x000055f60d51c99b in InitPostgres () #9 0x000055f60d3f2baf in PostgresMain () #10 0x000055f60d102bf8 in ?? () #11 0x000055f60d37e492 in PostmasterMain () #12 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 #1 0x000055f60d3d0101 in WaitEventSetWait () #2 0x000055f60d2dccff in secure_read () #3 0x000055f60d2e9138 in ?? () #4 0x000055f60d2e9e75 in pq_getbyte () #5 0x000055f60d3f3100 in PostgresMain () #6 0x000055f60d102bf8 in ?? () #7 0x000055f60d37e492 in PostmasterMain () #8 0x000055f60d103fa5 in main () (gdb) cont Continuing. ^C Program received signal SIGINT, Interrupt. 0x00007f8270a039f3 in delete_module () at ../sysdeps/unix/syscall-template.S:86 86 in ../sysdeps/unix/syscall-template.S (gdb) detach Detaching from program: /usr/lib/postgresql/11/bin/postgres, process 79572 (gdb) quit I hope this is useful. Let me know if there are any parameters I should try adjusting, or some other information I should try to extract. Thanks! Best regards, Adam -- "Why do you ask so much?" Adam Sjøgren "I'm a kid. It's what we do." asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Tom Lane
Date:
=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes: > Tom writes: >> * Adam Sjøgren (asjo@koldfront.dk) wrote: >>> Sometimes new database logins slow down, from usually taking <0.05s to >>> taking minutes. This is for psql as a normal user using Kerberos, for >>> psql as the postgres superuser, for the web-application logging into the >>> database, for everything. >> Also, obtaining stack traces from a few of the stuck processes would >> likely be quite informative. > I will try to look at pg_stat_activity and pg_locks the next time - it > just happened now and I focused on getting some stacktraces - > unfortunately I do not have the -dbg package installed, so I don't know > how useful they are. Frequently, stack traces taken without debug symbols are pretty useless, but you seem to have got lucky. It seems clear that the unexpected wait is here: > (gdb) bt > #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 > #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 > #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 > #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 > #4 0x000055f60d36c252 in PGSemaphoreLock () > #5 0x000055f60d3e35cc in LWLockAcquire () > #6 0x000055f60d3d14ee in ProcArrayAdd () > #7 0x000055f60d3e01a3 in InitProcessPhase2 () > #8 0x000055f60d51c99b in InitPostgres () > #9 0x000055f60d3f2baf in PostgresMain () > #10 0x000055f60d102bf8 in ?? () > #11 0x000055f60d37e492 in PostmasterMain () > #12 0x000055f60d103fa5 in main () which implies that the problem is unexpectedly high contention for the ProcArrayLock. It's unsurprising that incoming processes would need that lock, since they need to add themselves to the ProcArray. What's not so apparent is why there is so much activity as to block them from doing so for a significant amount of time. We can guess that the "worker processes" you mentioned are ganging up on that lock because they need to compute query snapshots, but that still doesn't seem like enough of an explanation. Can you show us exactly what the "worker processes" are doing while this is going on? regards, tom lane
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Tom writes: > Frequently, stack traces taken without debug symbols are pretty useless, > but you seem to have got lucky. For once, hah! > It seems clear that the unexpected wait is here: > >> (gdb) bt >> #0 0x00007f8272cf2827 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f7a2e9f1d38)at ../sysdeps/unix/sysv/linux/futex-internal.h:205 >> #1 do_futex_wait (sem=sem@entry=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:111 >> #2 0x00007f8272cf28d4 in __new_sem_wait_slow (sem=0x7f7a2e9f1d38, abstime=0x0) at sem_waitcommon.c:181 >> #3 0x00007f8272cf297a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29 >> #4 0x000055f60d36c252 in PGSemaphoreLock () >> #5 0x000055f60d3e35cc in LWLockAcquire () >> #6 0x000055f60d3d14ee in ProcArrayAdd () >> #7 0x000055f60d3e01a3 in InitProcessPhase2 () >> #8 0x000055f60d51c99b in InitPostgres () >> #9 0x000055f60d3f2baf in PostgresMain () >> #10 0x000055f60d102bf8 in ?? () >> #11 0x000055f60d37e492 in PostmasterMain () >> #12 0x000055f60d103fa5 in main () > > which implies that the problem is unexpectedly high contention for the > ProcArrayLock. It's unsurprising that incoming processes would need > that lock, since they need to add themselves to the ProcArray. What's > not so apparent is why there is so much activity as to block them > from doing so for a significant amount of time. We can guess that > the "worker processes" you mentioned are ganging up on that lock > because they need to compute query snapshots, but that still doesn't > seem like enough of an explanation. > > Can you show us exactly what the "worker processes" are doing while > this is going on? One thing I should have mentioned, but forgot, is that the database is configured to do logical replication to another machine - could that have an effect on the lock contention? A colleague pointed out that in the pg_locks output, the replication processes are waiting on the ProcArray lock: · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz I can try turning off replication the next time I see the problem. The workers basically loop on: * SELECT a job from a job-table, UPDATE it * Take an advisory lock on the relevant row in the calculations table * Do some work, at least a couple of SELECTs from various tables * UPDATE or INSERT a row into the calculations table, release the lock * INSERT a row in an exitstatus table * DELETE the job The calculations can be sub-second to several minutes, the majority mostly in the shorter end. As this is happening there are millions of jobs in the job-table. When the problem occurs, the worker logfiles indicate that they are moving along, still working. To give a sense of scale, here are the number of transactions per second plotted: · https://koldfront.dk/misc/postgresql/postgres_transactions_ALL-pinpoint_1597799383-1597907383.png (I think the hole in the graph is when Munin timed out getting the information, and the number of transactions dropped as I stopped most of the workers.) Best regards, Adam -- "Examination and mastering of a new highly Adam Sjøgren intellectual equipment was a hard labour." asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Tom Lane
Date:
=?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes: > Tom writes: >> ... which implies that the problem is unexpectedly high contention for the >> ProcArrayLock. > One thing I should have mentioned, but forgot, is that the database is > configured to do logical replication to another machine - could that > have an effect on the lock contention? > A colleague pointed out that in the pg_locks output, the replication > processes are waiting on the ProcArray lock: > · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz Yeah, that is *mighty* interesting. For the archives' sake, I'll point out that this shows 16 walsender processes, of which 6 are in WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10 are waiting for the ProcArrayLock. There are also three backends in ProcArrayGroupUpdate state, confirming that we have a lot of contention for the ProcArrayLock ... but interestingly, no non-walsender processes appear to be blocked directly on the ProcArrayLock. In particular, I *don't* see a backend that could plausibly be the leader of that group commit; the other sessions showing non-SELECT activity are waiting for assorted heavyweight locks, and they all look like they're still running their transactions, not trying to commit. Perhaps this just reflects the fact that pg_stat_activity isn't an instantaneous snapshot, though? It's also evident that there's some parallel query activity going on, as some sessions are in BgWorkerShutdown waits; so parallel worker entry/exit might be contributing to the overall load on ProcArrayLock. Anyway, we still don't know more than that the problem is ProcArrayLock contention; it seems that the walsenders are involved, but we don't really know if they're causing the problem or are just more victims. One thing I'm curious about is whether you have walsenders starting or stopping on a regular basis? The ones visible here seem to all have been running for at least an hour or so, but in a quick look through the code it seemed like much of the walsender code that would have any reason to take the ProcArrayLock is in walsender startup or shutdown. Another angle that is worth a try is to see whether you can reduce the number of walsenders, and then see if that ameliorates the issue. I'm not totally sure but I think we support cascaded replication, so that you could reduce the replication load on the primary just by reconfiguring some of the standbys to pull from other ones not directly from the primary. regards, tom lane
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Tom writes: > =?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes: >> Tom writes: >>> ... which implies that the problem is unexpectedly high contention for the >>> ProcArrayLock. > >> One thing I should have mentioned, but forgot, is that the database is >> configured to do logical replication to another machine - could that >> have an effect on the lock contention? >> A colleague pointed out that in the pg_locks output, the replication >> processes are waiting on the ProcArray lock: >> · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz > > Yeah, that is *mighty* interesting. For the archives' sake, I'll point > out that this shows 16 walsender processes, of which 6 are in > WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10 > are waiting for the ProcArrayLock. Sorry for not returning on this before now. I have seen the slow login problem - ProcArrayLock contention - happen with replication turned off, so I think that can be ruled out as the cause. I just observed something interesting right now, however. Just to recap the scene: the basic setup is that we have a table with a queue of jobs to be processed, and we have a number of programs (say 10) on 6 other machines grabbing those jobs, doing calculations and reporting back. The number of jobs at any given time can fluctuate from none to millions. Typically millions of jobs take some days to a week to get through. Now, I happened to have a psql prompt open when the problem of new logins being slow appeared - and I ran a query that counted the number of jobs by job-type. Before the slow login problem, that query took around 70-80 ms, but when the slow login appeared, it took 20+ seconds to run the query. EXPLAIN ANALYZE showed that it was running a parallel query. So I tried turning parallel queries off - with SET max_parallel_workers_per_gather = 0; - and now the query took around 150 ms consistenly, despite logins still being (variably) slow. So that confirms the ProcArrayLock contention, I think. My next thought is: I have 6*10 programs making connections and doing parallel queries over the job queue, to get new jobs. If the jobs happen to be very "easy", those programs will be doing a lot of queries for jobs. And when the queue is large enough (I'm guessing), those queries will be parallel, so they will be creating new processes that need to be added and removed from the ProcArray. So am I shooting myself in the foot here, by creating lock contention when trying to get jobs quickly from a largeish table? I think my next step will be to modify the programs that grab the jobs to avoid parallel queries when looking up the next job to run. You anticipated this, it just took me until now to get a little closer: > It's also evident that there's some parallel query activity going on, > as some sessions are in BgWorkerShutdown waits; so parallel worker > entry/exit might be contributing to the overall load on ProcArrayLock. Best regards, Adam -- "That's one of the remarkable things about life. Adam Sjøgren It's never so bad that it can't get worse." asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adrian Klaver
Date:
On 9/30/20 1:22 PM, Adam Sjøgren wrote: > Tom writes: > >> =?utf-8?Q?Adam_Sj=C3=B8gren?= <asjo@koldfront.dk> writes: >>> Tom writes: >>>> ... which implies that the problem is unexpectedly high contention for the >>>> ProcArrayLock. >> >>> One thing I should have mentioned, but forgot, is that the database is >>> configured to do logical replication to another machine - could that >>> have an effect on the lock contention? >>> A colleague pointed out that in the pg_locks output, the replication >>> processes are waiting on the ProcArray lock: >>> · https://koldfront.dk/misc/postgresql/pg_stat_activity_pg_locks-BAD.txt.gz >> >> Yeah, that is *mighty* interesting. For the archives' sake, I'll point >> out that this shows 16 walsender processes, of which 6 are in >> WalSenderWaitForWAL waits (ie, basically idle) and all of the other 10 >> are waiting for the ProcArrayLock. > > Sorry for not returning on this before now. > > I have seen the slow login problem - ProcArrayLock contention - happen > with replication turned off, so I think that can be ruled out as the > cause. > > I just observed something interesting right now, however. > > Just to recap the scene: the basic setup is that we have a table with a > queue of jobs to be processed, and we have a number of programs (say 10) > on 6 other machines grabbing those jobs, doing calculations and > reporting back. I don't have an answer. Not even sure if this is relevant to the problem, but how are the jobs getting into the queue? > > The number of jobs at any given time can fluctuate from none to > millions. Typically millions of jobs take some days to a week to get > through. > > Now, I happened to have a psql prompt open when the problem of new > logins being slow appeared - and I ran a query that counted the number > of jobs by job-type. > > Before the slow login problem, that query took around 70-80 ms, but when > the slow login appeared, it took 20+ seconds to run the query. > > EXPLAIN ANALYZE showed that it was running a parallel query. > > So I tried turning parallel queries off - with SET > max_parallel_workers_per_gather = 0; - and now the query took around 150 > ms consistenly, despite logins still being (variably) slow. > > So that confirms the ProcArrayLock contention, I think. > > My next thought is: I have 6*10 programs making connections and doing > parallel queries over the job queue, to get new jobs. If the jobs happen > to be very "easy", those programs will be doing a lot of queries for > jobs. And when the queue is large enough (I'm guessing), those queries > will be parallel, so they will be creating new processes that need to be > added and removed from the ProcArray. > > So am I shooting myself in the foot here, by creating lock contention > when trying to get jobs quickly from a largeish table? > > I think my next step will be to modify the programs that grab the jobs > to avoid parallel queries when looking up the next job to run. > > You anticipated this, it just took me until now to get a little closer: > >> It's also evident that there's some parallel query activity going on, >> as some sessions are in BgWorkerShutdown waits; so parallel worker >> entry/exit might be contributing to the overall load on ProcArrayLock. > > > Best regards, > > Adam > -- Adrian Klaver adrian.klaver@aklaver.com
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adam Sjøgren
Date:
Adrian writes: > I don't have an answer. Not even sure if this is relevant to the > problem, but how are the jobs getting into the queue? Plain INSERTs - often a lot at the same time. I inserted 400K jobs to clean up after a bug earlier today, for instance. What were you suspecting? Best regards, Adam -- "Don't think twice Adam Sjøgren It's all right" asjo@koldfront.dk
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Adrian Klaver
Date:
On 9/30/20 2:30 PM, Adam Sjøgren wrote: > Adrian writes: > >> I don't have an answer. Not even sure if this is relevant to the >> problem, but how are the jobs getting into the queue? > > Plain INSERTs - often a lot at the same time. I inserted 400K jobs to > clean up after a bug earlier today, for instance. > > What were you suspecting? Honestly a fishing expedition. All the discussion, as far as I could remember, had to do with the retrieval part of the process. Just thought for completeness it might be useful to flesh out the insertion portion of the process. Though it does get one to wondering how the appearance of a large number of jobs at once affects the those programs looking for new jobs? > > > Best regards, > > Adam > -- Adrian Klaver adrian.klaver@aklaver.com
Re: Database logins taking longer and longer, showing up as "authentication" in ps(1)
From
Michael Lewis
Date:
On Wed, Sep 30, 2020 at 3:41 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 9/30/20 2:30 PM, Adam Sjøgren wrote:
> Adrian writes:
>
>> I don't have an answer. Not even sure if this is relevant to the
>> problem, but how are the jobs getting into the queue?
>
> Plain INSERTs - often a lot at the same time. I inserted 400K jobs to
> clean up after a bug earlier today, for instance.
>
> What were you suspecting?
Honestly a fishing expedition. All the discussion, as far as I could
remember, had to do with the retrieval part of the process. Just thought
for completeness it might be useful to flesh out the insertion portion
of the process. Though it does get one to wondering how the appearance
of a large number of jobs at once affects the those programs looking for
new jobs?
Is autovacuum/analyze keeping up with the changes to allow the planner to make prudent choices? With so many updates/deletes, I would hope so but if still using the 20% or 10% default values for scale_factor, and at times the table has many millions of rows, perhaps not.