G'day all,
A quick follow-up on this issue for interest's sake. The stalling we
were seeing turned out to be a Cloud SQL issue and not related to our
listen/notify usage.
Cloud SQL has an automatic storage increase process that resizes the
underlying disk as required to account for cluster growth. As it turns
out that process occasionally causes I/O to stall for a brief window.
https://cloud.google.com/sql/docs/postgres/instance-settings#automatic-storage-increase-2ndgen
The workaround supplied by Google is to manually provision slack
storage in larger increments to prevent the more frequent automatic
increases, which happen 25GB at a time on a large cluster.
We didn't make the connection because disk resize events are not
visible in any logs; Google Support found the issue by correlating the
timestamps of our observed outages with their internal logs.
Hopefully this is useful for someone else. Thanks again for your help
Tom - your advice on listen/notify locking on commit was very useful
despite not being the cause in this case.
Cheers
Ben
On Mon, 1 Feb 2021 at 12:33, Ben Hoskings <ben@hoskings.net> wrote:
>
> On Mon, 1 Feb 2021 at 10:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > One thing that just occurred to me is that you might find it
> > interesting to keep tabs on what's in the $PGDATA/pg_notify
> > directory. Do the performance burps correspond to transitory
> > peaks in the amount of data there? Or (grasping at straws here...)
> > wraparound of the file names back to 0000?
>
> We don't have filesystem access on Cloud SQL - the downside of the
> managed route :)
>
> It sounds like it might be time to bump the pg13 upgrade up the TODO list.
>
> Cheers
> Ben