Thread: Auto-vacuum timing out and preventing connections

Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

27 June 2022, 20:38:21

We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:

WARNING: worker took too long to start; canceled

The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.

These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem

https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org

https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

What can we do to diagnose this problem and get our database working reliably again?

Thanks,

Dave

Re: Auto-vacuum timing out and preventing connections

From

Julien Rouhaud

Date:

28 June 2022, 04:42:39

Hi,

On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING:  worker took too long to start; canceled

I don't think that autovacuum is the reason of the problem, but just another
victim of the same problem as the autovacuum launcher is still active and tries
to schedule workers, which can't connect either.

> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

Are you saying that you have to reboot every 5 minutes?

Also, do you mean reboot the server or just restarting the postgres service is
enough?

> These are the most similarly related previous posts, but the CPU usage
> isn't high when this happens, so I don't believe that's the problem
> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
>
> What can we do to diagnose this problem and get our database working
> reliably again?

As mentioned in the 2nd link, getting a strace of the postmaster when the
problem happens may help.

Re: Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

28 June 2022, 17:24:23

On Mon, Jun 27, 2022 at 10:42 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

On Mon, Jun 27, 2022 at 02:38:21PM -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled

I don't think that autovacuum is the reason of the problem, but just another
victim of the same problem as the autovacuum launcher is still active and tries
to schedule workers, which can't connect either.

Sorry, I should have provided some more details. These logs happen for 12-24 hours before the server stops accepting connections.

> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

Are you saying that you have to reboot every 5 minutes?

That error log happens every 5 minutes and that's the nap time.

Also, do you mean reboot the server or just restarting the postgres service is
enough?

Restarting the postgres service.

> These are the most similarly related previous posts, but the CPU usage
> isn't high when this happens, so I don't believe that's the problem
> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
>
> What can we do to diagnose this problem and get our database working
> reliably again?

As mentioned in the 2nd link, getting a strace of the postmaster when the
problem happens may help.

This is running in RDS on AWS, so I don't believe I can do an strace on the service.

Re: Auto-vacuum timing out and preventing connections

From

Jeff Janes

Date:

28 June 2022, 19:31:14

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:

We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?

What version are you using?

These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

But, I don't see high CPU described as a symptom in either of those threads.

If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.

Cheers,

Jeff

Re: Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

28 June 2022, 20:05:31

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:
We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.

What version are you using?

13.6

These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't believe that's the problem
https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com

But, I don't see high CPU described as a symptom in either of those threads.

I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in those posts, but it doesn't appear to be the issue here.

If you can't reproduce the problem locally, there probably isn't much we can do. Maybe ask Amazon to look into it, since they are the only ones with sufficient access to do so.

We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there a way to tell which table the auto-vacuum is trying to run on and timing out with?

Re: Auto-vacuum timing out and preventing connections

From

Masahiko Sawada

Date:

30 June 2022, 02:22:35

On Wed, Jun 29, 2022 at 5:05 AM David Johansen <davejohansen@gmail.com> wrote:
>
> On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:
>>>
>>> We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out
andthen that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
 
>>> WARNING:  worker took too long to start; canceled
>>> The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.
>>
>>
>> As Julien suggested, this sounds like another victim, not the cause.  Is there anything else in the log files?
>
>
> That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.
>
>>
>> What version are you using?
>
>
> 13.6
>
>>>
>>> These are the most similarly related previous posts, but the CPU usage isn't high when this happens, so I don't
believethat's the problem
 
>>> https://www.postgresql.org/message-id/20081105185206.GS4114%40alvh.no-ip.org
>>> https://www.postgresql.org/message-id/AANLkTinsGLeRc26RT5Kb4_HEhow5e97p0ZBveg=p9xqS@mail.gmail.com
>>
>>
>> But, I don't see high CPU described as a symptom in either of those threads.
>
>
> I was referring to the "I've seen this happen under heavy load" statement. Not sure that's the cause or related in
thoseposts, but it doesn't appear to be the issue here.
 
>
>>
>>  If you can't reproduce the problem locally, there probably isn't much we can do.  Maybe ask Amazon to look into it,
sincethey are the only ones with sufficient access to do so.
 
>
>
> We've opened a support case, but I was trying to be proactive and seeing what we could dig into on our end. Is there
away to tell which table the auto-vacuum is trying to run on and timing out with?
 

Autovacuum workers launch per database. The situation where the
warning "worker took too long to start; canceled" occurs is that an
autovacuum worker for the particular database took a long time for its
startup phase (initializing and setting parameters etc.). There is no
way to know neither which table nor which database.

If it's reproducible, it may help investigate it if you could collect
the contents of pg_stat_activity when the issue is happening in order
to see if there is another process waiting.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Auto-vacuum timing out and preventing connections

From

Andres Freund

Date:

02 July 2022, 03:11:39

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING:  worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

How many databases, tables and functions do you have?


> What can we do to diagnose this problem and get our database working
> reliably again?

Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.

Greetings,

Andres Freund

Re: Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

02 July 2022, 04:34:13

On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

How many databases, tables and functions do you have?

1 database

427 tables

1 function

> What can we do to diagnose this problem and get our database working
> reliably again?

Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.

Yes, I definitely understand that AWS will likely need to be involved or resolve this issue, but as I said previously, I'm trying to be proactive and just see what I could find.

Re: Auto-vacuum timing out and preventing connections

From

"Xiong He"

Date:

04 July 2022, 04:29:38

You can get the auto vacuum activity from pg_stat view. try to fond which table causes the timeout. and even more you csn disable autovacuum to check whether it still happens.

---Original---

From: "David Johansen"<davejohansen@gmail.com>

Date: Sat, Jul 2, 2022 12:34 PM

To: "Andres Freund"<andres@anarazel.de>;

Cc: "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;

Subject: Re: Auto-vacuum timing out and preventing connections

On Fri, Jul 1, 2022 at 9:11 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-27 14:38:21 -0600, David Johansen wrote:
> We're running into an issue where the database can't be connected to. It
> appears that the auto-vacuum is timing out and then that prevents new
> connections from happening. This assumption is based on these logs showing
> up in the logs:
> WARNING: worker took too long to start; canceled
> The log appears about every 5 minutes and eventually nothing can connect to
> it and it has to be rebooted.

How many databases, tables and functions do you have?

1 database

427 tables

1 function

> What can we do to diagnose this problem and get our database working
> reliably again?

Downthread you said you're on RDS - unfortunately you're going to have quite
limited access and are going to have to take it up with AWS.

Yes, I definitely understand that AWS will likely need to be involved or resolve this issue, but as I said previously, I'm trying to be proactive and just see what I could find.

Re: Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

14 July 2022, 16:51:39

On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com> wrote:

On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com> wrote:
We're running into an issue where the database can't be connected to. It appears that the auto-vacuum is timing out and then that prevents new connections from happening. This assumption is based on these logs showing up in the logs:
WARNING: worker took too long to start; canceled
The log appears about every 5 minutes and eventually nothing can connect to it and it has to be rebooted.

As Julien suggested, this sounds like another victim, not the cause. Is there anything else in the log files?

That's the only thing in the logs for the 12-24 hours before the database becomes inaccessible.

To follow up on this, this was the symptom and not the cause. The auto-vacuum was failing to start because of a bug and not the cause of the problem.

Re: Auto-vacuum timing out and preventing connections

From

Andres Freund

Date:

15 July 2022, 03:42:18

Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
> wrote:
> 
> > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
> >> wrote:
> >>
> >>> We're running into an issue where the database can't be connected to. It
> >>> appears that the auto-vacuum is timing out and then that prevents new
> >>> connections from happening. This assumption is based on these logs showing
> >>> up in the logs:
> >>> WARNING:  worker took too long to start; canceled
> >>> The log appears about every 5 minutes and eventually nothing can connect
> >>> to it and it has to be rebooted.
> >>>
> >>
> >> As Julien suggested, this sounds like another victim, not the cause.  Is
> >> there anything else in the log files?
> >>
> >
> > That's the only thing in the logs for the 12-24 hours before the database
> > becomes inaccessible.
> >
> 
> To follow up on this, this was the symptom and not the cause. The
> auto-vacuum was failing to start because of a bug and not the cause of the
> problem.

What bug?

Greetings,

Andres Freund

Re: Auto-vacuum timing out and preventing connections

From

David Johansen

Date:

15 July 2022, 04:22:43

On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
> wrote:
>
> > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
> >> wrote:
> >>
> >>> We're running into an issue where the database can't be connected to. It
> >>> appears that the auto-vacuum is timing out and then that prevents new
> >>> connections from happening. This assumption is based on these logs showing
> >>> up in the logs:
> >>> WARNING: worker took too long to start; canceled
> >>> The log appears about every 5 minutes and eventually nothing can connect
> >>> to it and it has to be rebooted.
> >>>
> >>
> >> As Julien suggested, this sounds like another victim, not the cause. Is
> >> there anything else in the log files?
> >>
> >
> > That's the only thing in the logs for the 12-24 hours before the database
> > becomes inaccessible.
> >
>
> To follow up on this, this was the symptom and not the cause. The
> auto-vacuum was failing to start because of a bug and not the cause of the
> problem.

What bug?

It appears to have been related to the scaling and process management that Aurora Serverless V2 does. I haven't been able to find any info posted about this issue from AWS, but we opened a support case and were told the following:

We have identified a critical stability update for Aurora PostgreSQL Serverless v2 instances running versions 13.6, 13.7, and 14.3. We have also identified a critical issue in Aurora PostgreSQL Serverless v2 clusters running versions 13.7 and 14.3. These issues can cause database restarts or failovers under specific conditions. We have developed fixes and are deploying the fixes in two patches. The patches will be automatically applied to the affected instances and clusters in upcoming maintenance windows over the next 3 weeks causing two restarts of your database. One patch will show as a security and stability update and one patch will show as a database update. They will be scheduled sequentially.

The symptoms we observed were slightly different than what is described above, but we manually applied the patches as soon as they were available and haven't noticed the problem since.