Thread: Auto-vacuum timing out and preventing connections
We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.
These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
What can we do to diagnose this problem and get our database working reliably again?
Thanks,
Dave
Hi, On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote: > We're running into an issue where the database can't be connected to. It > appears that the auto-vacuum is timing out and then that prevents new > connections from happening. This assumption is based on these logs showing > up in the logs: > WARNING: worker took too long to start; canceled I don't think that autovacuum is the reason of the problem, but just another victim of the same problem as the autovacuum launcher is still active and tries to schedule workers, which can't connect either. > The log appears about every 5 minutes and eventually nothing can connect to > it and it has to be rebooted. Are you saying that you have to reboot every 5 minutes? Also, do you mean reboot the server or just restarting the postgres service is enough? > These are the most similarly related previous posts, but the CPU usage > isn't high when this happens, so I don't believe that's the problem > https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org > https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com > > What can we do to diagnose this problem and get our database working > reliably again? As mentioned in the 2nd link, getting a strace of the postmaster when the problem happens may help.
On Mon, Jun 27, 2022 at 10:42 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
Hi,
On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled
I don't think that autovacuum is the reason of the problem, but just another
victim of the same problem as the autovacuum launcher is still active and tries
to schedule workers, which can't connect either.
Sorry, I should have provided some more details. These logs happen for 12-24 hours before the server stops accepting connections.
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.
Are you saying that you have to reboot every 5 minutes?
That error log happens every 5 minutes and that's the nap time.
Also, do you mean reboot the server or just restarting the postgres service is
enough?
Restarting the postgres service.
> These are the most similarly related previous posts, but the CPU usage
> isn't high when this happens, so I don't believe that's the problem
> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
>
> What can we do to diagnose this problem and get our database working
> reliably again?
As mentioned in the 2nd link, getting a strace of the postmaster when the
problem happens may help.
This is running in RDS on AWS, so I don't believe I can do an strace on the service.
On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:
We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:WARNING: worker took too long to start; canceledThe log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.
As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?
What version are you using?
These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
But, I don't see high CPU described as a symptom in either of those threads.
If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.
Cheers,
Jeff
On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:WARNING: worker took too long to start; canceledThe log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?
That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.
What version are you using?
13.6
These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problemBut, I don't see high CPU described as a symptom in either of those threads.
I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in those posts, but it doesn't appear to be the issue here.
If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.
We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there a way to tell which table the auto-vacuum is trying to run on and timing out with?
On Wed, Jun 29, 2022 at 5:05 AM David Johansen <davejohansen@gmail.com> wrote: > > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote: >> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote: >>> >>> We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out andthen that prevents new connections from happening. This assumption is based on these logs showing up in the logs: >>> WARNING: worker took too long to start; canceled >>> The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted. >> >> >> As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files? > > > That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible. > >> >> What version are you using? > > > 13.6 > >>> >>> These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believethat's the problem >>> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org >>> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com >> >> >> But, I don't see high CPU described as a symptom in either of those threads. > > > I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in thoseposts, but it doesn't appear to be the issue here. > >> >> If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, sincethey are the only ones with sufficient access to do so. > > > We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there away to tell which table the auto-vacuum is trying to run on and timing out with? Autovacuum workers launch per database. The situation where the warning "worker took too long to start; canceled" occurs is that an autovacuum worker for the particular database took a long time for its startup phase (initializing and setting parameters etc.). There is no way to know neither which table nor which database. If it's reproducible, it may help investigate it if you could collect the contents of pg_stat_activity when the issue is happening in order to see if there is another process waiting. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi, On 2022-06-27 14:38:21 -0600, David Johansen wrote: > We're running into an issue where the database can't be connected to. It > appears that the auto-vacuum is timing out and then that prevents new > connections from happening. This assumption is based on these logs showing > up in the logs: > WARNING: worker took too long to start; canceled > The log appears about every 5 minutes and eventually nothing can connect to > it and it has to be rebooted. How many databases, tables and functions do you have? > What can we do to diagnose this problem and get our database working > reliably again? Downthread you said you're on RDS - unfortunately you're going to have quite limited access and are going to have to take it up with AWS. Greetings, Andres Freund
On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.
How many databases, tables and functions do you have?
1 database
427 tables
1 function
> What can we do to diagnose this problem and get our database working
> reliably again?
Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.
Yes, I definitely understand that AWS will likely need to be involved or resolve this issue, but as I said previously, I'm trying to be proactive and just see what I could find.
You can get the auto vacuum activity from pg_stat view. try to fond which table causes the timeout. and even more you csn disable autovacuum to check whether it still happens.
---Original---
From: "David Johansen"<davejohansen@gmail.com>
Date: Sat, Jul 2, 2022 12:34 PM
To: "Andres Freund"<andres@anarazel.de>;
Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: Auto-vacuum timing out and preventing connections
On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.
How many databases, tables and functions do you have?
1 database
427 tables
1 function
> What can we do to diagnose this problem and get our database working
> reliably again?
Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.
Yes, I definitely understand that AWS will likely need to be involved or resolve this issue, but as I said previously, I'm trying to be proactive and just see what I could find.
On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com> wrote:
On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:WARNING: worker took too long to start; canceledThe log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.
To follow up on this, this was the symptom and not the cause. The auto-vacuum was failing to start because of a bug and not the cause of the problem.
Hi, On 2022-07-14 10:51:39 -0600, David Johansen wrote: > On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com> > wrote: > > > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote: > > > >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> > >> wrote: > >> > >>> We're running into an issue where the database can't be connected to. It > >>> appears that the auto-vacuum is timing out and then that prevents new > >>> connections from happening. This assumption is based on these logs showing > >>> up in the logs: > >>> WARNING: worker took too long to start; canceled > >>> The log appears about every 5 minutes and eventually nothing can connect > >>> to it and it has to be rebooted. > >>> > >> > >> As Julien suggested, this sounds like another victim, not the cause. Is > >> there anything else in the log files? > >> > > > > That's the only thing in the logs for the 12-24 hours before the database > > becomes inaccessible. > > > > To follow up on this, this was the symptom and not the cause. The > auto-vacuum was failing to start because of a bug and not the cause of the > problem. What bug? Greetings, Andres Freund
On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
> wrote:
>
> > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
> >> wrote:
> >>
> >>> We're running into an issue where the database can't be connected to. It
> >>> appears that the auto-vacuum is timing out and then that prevents new
> >>> connections from happening. This assumption is based on these logs showing
> >>> up in the logs:
> >>> WARNING: worker took too long to start; canceled
> >>> The log appears about every 5 minutes and eventually nothing can connect
> >>> to it and it has to be rebooted.
> >>>
> >>
> >> As Julien suggested, this sounds like another victim, not the cause. Is
> >> there anything else in the log files?
> >>
> >
> > That's the only thing in the logs for the 12-24 hours before the database
> > becomes inaccessible.
> >
>
> To follow up on this, this was the symptom and not the cause. The
> auto-vacuum was failing to start because of a bug and not the cause of the
> problem.
What bug?
It appears to have been related to the scaling and process management that Aurora Serverless V2 does. I haven't been able to find any info posted about this issue from AWS, but we opened a support case and were told the following:
The symptoms we observed were slightly different than what is described above, but we manually applied the patches as soon as they were available and haven't noticed the problem since.