Home > mailing lists

Re: Auto-vacuum timing out and preventing connections - Mailing list pgsql-bugs

From	David Johansen
Subject	Re: Auto-vacuum timing out and preventing connections
Date	July 15, 2022 07:22:43
Msg-id	CAAcYxUcKYdWBRXUah_QNDTi4BfzsyPBw6KTC054PTWZeYLmNew@mail.gmail.com Whole thread Raw
In response to	Re: Auto-vacuum timing out and preventing connections (Andres Freund <andres@anarazel.de>)
List	pgsql-bugs

Tree view

On Thu, Jul 14, 2022 at 9:42 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-14 10:51:39 -0600, David Johansen wrote:
> On Tue, Jun 28, 2022 at 2:05 PM David Johansen <davejohansen@gmail.com>
> wrote:
>
> > On Tue, Jun 28, 2022 at 1:31 PM Jeff Janes <jeff.janes@gmail.com> wrote:
> >
> >> On Mon, Jun 27, 2022 at 4:38 PM David Johansen <davejohansen@gmail.com>
> >> wrote:
> >>
> >>> We're running into an issue where the database can't be connected to. It
> >>> appears that the auto-vacuum is timing out and then that prevents new
> >>> connections from happening. This assumption is based on these logs showing
> >>> up in the logs:
> >>> WARNING: worker took too long to start; canceled
> >>> The log appears about every 5 minutes and eventually nothing can connect
> >>> to it and it has to be rebooted.
> >>>
> >>
> >> As Julien suggested, this sounds like another victim, not the cause. Is
> >> there anything else in the log files?
> >>
> >
> > That's the only thing in the logs for the 12-24 hours before the database
> > becomes inaccessible.
> >
>
> To follow up on this, this was the symptom and not the cause. The
> auto-vacuum was failing to start because of a bug and not the cause of the
> problem.

What bug?

It appears to have been related to the scaling and process management that Aurora Serverless V2 does. I haven't been able to find any info posted about this issue from AWS, but we opened a support case and were told the following:

We have identified a critical stability update for Aurora PostgreSQL Serverless v2 instances running versions 13.6, 13.7, and 14.3. We have also identified a critical issue in Aurora PostgreSQL Serverless v2 clusters running versions 13.7 and 14.3. These issues can cause database restarts or failovers under specific conditions. We have developed fixes and are deploying the fixes in two patches. The patches will be automatically applied to the affected instances and clusters in upcoming maintenance windows over the next 3 weeks causing two restarts of your database. One patch will show as a security and stability update and one patch will show as a database update. They will be scheduled sequentially.

The symptoms we observed were slightly different than what is described above, but we manually applied the patches as soon as they were available and haven't noticed the problem since.

pgsql-bugs by date:

From: Andres Freund
Date: 15 July 2022, 06:58:27
Subject: Re: [15] Custom WAL resource managers, single user mode, and recovery

From: Kyotaro Horiguchi
Date: 15 July 2022, 07:55:32
Subject: Re: Excessive number of replication slots for 12->14 logical replication

Re: Auto-vacuum timing out and preventing connections - Mailing list pgsql-bugs

Previous

Next