Re: Slow shutdowns sometimes on RDS Postgres - Mailing list pgsql-general

From Jeremy Schneider
Subject Re: Slow shutdowns sometimes on RDS Postgres
Date
Msg-id 4c65f988-a5ee-53f1-2d58-476a5c244cd2@amazon.com
Whole thread Raw
In response to Re: Slow shutdowns sometimes on RDS Postgres  (Christophe Pettus <xof@thebuild.com>)
Responses Re: Slow shutdowns sometimes on RDS Postgres  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
On 9/14/18 10:04, Christophe Pettus wrote:
In our experience, it's actually quite common that an RDS shutdown (or even just applying parameter changes) can take a while. What's particularly concerning is that it's not predictable, and that can make it hard to schedule and manage maintenance windows. What we were told previously is that RDS queues the operations, and it can take a variable amount of time for the operation to be worked on from the queue. Is that not the case?

Thanks Christophe - even if it's not what Chris is running into, this is is another good call-out.

It's important to distinguish here between the RDS parts and the community PostgreSQL parts.  I think for this thread it's just worth pointing out that RDS automation/tooling will report the database in a "modifying" state until it completes its management operations, however the actual database unavailability is much shorter.  RDS carefully engineers their processes to minimize the actual database unavailability itself.

Chris has run into a problem where the PostgreSQL processes did not shut down, evidenced by the error messages he mentioned, and as a result his database was actually unavailable to applications for an extended period.  This is uncommon and concerning.

This isn't the right forum for discussing the RDS bits; lets take that to the AWS forums.  It's not synchronous, but the time to complete should absolutely be predictable within reasonable bounds depending on the operation type. I don't know how anyone could use the platform otherwise!  If anyone is unable to establish bounded expectations for some automated operation, I'd strongly encourage starting a thread on the AWS forums or opening a support ticket.


On 9/14/18 09:27, Adrian Klaver wrote:
The thing is I do not remember any posts to this list mentioning the same problem on a platform outside RDS. A quick search seems to confirm that.
I've met folks from other large fleet operators at PG conferences.  There are all kinds of stories we don't find on the lists yet.  :)  Hopefully we're all getting better about closing the loop and sharing stuff back - that's part of the value large fleet operators can and should bring to the community.

I don't know about this specific incident, but I do know that the RDS
team has seen cases where a backend gets into a state (like a system
call) where it's not checking signals and thus doesn't receive or
process the postmaster's request to quit. We've seen these processes
delay shutdowns and also block recovery on streaming replicas.

The particulars of that state?
For the cases I've heard about, we haven't yet caught things quickly enough to get stack dumps.  So I don't think we have particulars yet.

-Jeremy

-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

pgsql-general by date:

Previous
From: Andreas Brandl
Date:
Subject: commit timestamps and replication
Next
From: ik
Date:
Subject: Query act different when doing by hand and by using a driver in app