Re: Auto-vacuum timing out and preventing connections - Mailing list pgsql-bugs

From David Johansen
Subject Re: Auto-vacuum timing out and preventing connections
Date
Msg-id CAAcYxUcKYdWBRXUah_QNDTi4BfzsyPBw6KTC054PTWZeYLmNew@mail.gmail.com
Whole thread Raw
In response to Re: Auto-vacuum timing out and preventing connections  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
> wrote:
>
> > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
> >> wrote:
> >>
> >>> We're running into an issue where the database can't be connected to. It
> >>> appears that the auto-vacuum is timing out and then that prevents new
> >>> connections from happening. This assumption is based on these logs showing
> >>> up in the logs:
> >>> WARNING:  worker took too long to start; canceled
> >>> The log appears about every 5 minutes and eventually nothing can connect
> >>> to it and it has to be rebooted.
> >>>
> >>
> >> As Julien suggested, this sounds like another victim, not the cause.  Is
> >> there anything else in the log files?
> >>
> >
> > That's the only thing in the logs for the 12-24 hours before the database
> > becomes inaccessible.
> >
>
> To follow up on this, this was the symptom and not the cause. The
> auto-vacuum was failing to start because of a bug and not the cause of the
> problem.

What bug?

It appears to have been related to the scaling and process management that Aurora Serverless V2 does. I haven't been able to find any info posted about this issue from AWS, but we opened a support case and were told the following:

We have identified a critical stability update for Aurora PostgreSQL Serverless v2 instances running versions 13.6, 13.7, and 14.3. We have also identified a critical issue in Aurora PostgreSQL Serverless v2 clusters running versions 13.7 and 14.3. These issues can cause database restarts or failovers under specific conditions. We have developed fixes and are deploying the fixes in two patches. The patches will be automatically applied to the affected instances and clusters in upcoming maintenance windows over the next 3 weeks causing two restarts of your database. One patch will show as a security and stability update and one patch will show as a database update. They will be scheduled sequentially.

The symptoms we observed were slightly different than what is described above, but we manually applied the patches as soon as they were available and haven't noticed the problem since.

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: [15] Custom WAL resource managers, single user mode, and recovery
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Excessive number of replication slots for 12->14 logical replication