Re: Autovacuum worker doesn't immediately exit on postmaster death - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Autovacuum worker doesn't immediately exit on postmaster death |
Date | |
Msg-id | CA+hUKG+pHf0NXxAJAs=wZW5cpMUw++gmdb+e=cAy0vEoN9hB8w@mail.gmail.com Whole thread Raw |
In response to | Re: Autovacuum worker doesn't immediately exit on postmaster death (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Fri, Dec 11, 2020 at 8:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Maybe instead of thinking specifically in terms of vacuum, we could > > count buffer accesses (read from kernel) and check the latch once every > > 1000th such, or something like that. Then a very long query doesn't > > have to wait until it's run to completion. The cost is one integer > > addition per syscall, which should be bearable. > > Interesting idea. One related case is where everything is fine on the > server side but the client has disconnected and we don't notice that > the socket has changed state until something makes us try to send a > message to the client, which might be a really long time if the > server's doing like a lengthy computation before generating any rows. > It would be really nice if we could find a cheap way to check for both > postmaster death and client disconnect every now and then, like if a > single system call could somehow answer both questions. For the record, an alternative approach was proposed[1] that periodically checks for disconnected sockets using a timer, that will then cause the next CFI() to abort. Doing the check (a syscall) based on elapsed time rather than every nth CFI() or buffer access or whatever seems better in some ways, considering the difficulty of knowing what the frequency will be. One of the objections was that it added unacceptable setitimer() calls. We discussed an idea to solve that problem generally, and then later I prototyped that idea in another thread[2] about idle session timeouts (not sure about that yet, comments welcome). I've also wondered about checking postmaster_possibly_dead in CFI() on platforms where we have it (and working to increase that set of platforms), instead of just reacting to PM death when sleeping. But it seems like the real problem in this specific case is the use of pg_usleep() where WaitLatch() should be used, no? The recovery loop is at the opposite end of the spectrum: while vacuum doesn't check for postmaster death often enough, the recovery loop checks potentially hundreds of thousands or millions of times per seconds, which sucks on systems that don't have parent-death signals and slows down recovery quite measurably. In the course of the discussion about fixing that[3] we spotted other places that are using a pg_usleep() where they ought to be using WaitLatch() (which comes with exit-on-PM-death behaviour built-in). By the way, the patch in that thread does almost what Robert described, namely check for PM death every nth time (which in this case means every nth WAL record), except it's not in the main CFI(), it's in a special variant used just for recovery. [1] https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru [2] https://www.postgresql.org/message-id/flat/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com [3] https://www.postgresql.org/message-id/flat/CA+hUKGK1607VmtrDUHQXrsooU=ap4g4R2yaoByWOOA3m8xevUQ@mail.gmail.com
pgsql-hackers by date: