Thread: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics

BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics

From
dimitri@2ndQuadrant.fr
Date:
The following bug has been logged on the website:

Bug reference:      13895
Logged by:          Dimitri Fontaine
Email address:      dimitri@2ndQuadrant.fr
PostgreSQL version: 9.3.10
Operating system:   Linux
Description:

I witnessed a case where I had no time to extract information about it, so
it's all from memory, sorry about that.

We had several Hot Standby nodes in 9.3.x used for load balancing read only
queries. All queries were stuck, and pg_locks showed they were refused a
lock against pg_catalog.pg_statistics.

The lock owner PID was not to be found in the pg_stat_activity system view,
for it was the replay process.

I don't remember how I got to the conclusion it was replaying specifically
autocavuum activity, but if that's helpful, that is what I remember it was
about.

My todo list includes extracting useful information next time it happens,
before we just restart the standby (a restart fixes the availability /
situation).

Regards,
dimitri@2ndQuadrant.fr wrote:

>
> I witnessed a case where I had no time to extract information about it, so
> it's all from memory, sorry about that.
>
> We had several Hot Standby nodes in 9.3.x used for load balancing read only
> queries. All queries were stuck, and pg_locks showed they were refused a
> lock against pg_catalog.pg_statistics.

My WAG is that this is related to the standby replaying the
AccessExclusive lock obtained to truncate pg_statistics.  That seems
consistent with the presented symptoms.

I would have assumed that the actual truncation shouldn't take a
particularly long time, so user processes on the standby wouldn't be
locked for long enough to cause visible trouble.  Maybe the table was
originally very bloated and a lot of pages got removed, leading the
replay process to take some time.  It is strange that it would be locked
for long enough to allow for manually querying pg_locks and
pg_stat_activity.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> replay process to take some time.  It is strange that it would be locked
> for long enough to allow for manually querying pg_locks and
> pg_stat_activity.

We restarted the standby after maybe 35 mins of application downtime due
to the incident=E2=80=A6 the diagnostic and restart did happen a tad late i=
n the
process on that ocasion, as it sometimes happens.

So the lock has been kept about that long.
--=20
Dimitri Fontaine                                        06 63 07 10 78
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support
On Thu, Jan 28, 2016 at 11:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

> dimitri@2ndQuadrant.fr wrote:
>
> >
> > I witnessed a case where I had no time to extract information about it,
> so
> > it's all from memory, sorry about that.
> >
> > We had several Hot Standby nodes in 9.3.x used for load balancing read
> only
> > queries. All queries were stuck, and pg_locks showed they were refused =
a
> > lock against pg_catalog.pg_statistics.
>
> My WAG is that this is related to the standby replaying the
> AccessExclusive lock obtained to truncate pg_statistics.  That seems
> consistent with the presented symptoms.
>
> I would have assumed that the actual truncation shouldn't take a
> particularly long time, so user processes on the standby wouldn't be
> locked for long enough to cause visible trouble.  Maybe the table was
> originally very bloated and a lot of pages got removed, leading the
> replay process to take some time.  It is strange that it would be locked
> for long enough to allow for manually querying pg_locks and
> pg_stat_activity.
>
> --
> =C3=81lvaro Herrera                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

=E2=80=8BIf pg_statistics had been seriously bloated (like few GB of empty =
space at
the end),
and in the same time mechanical (and busy) HDD used for the database, the
vacuum truncation could require really long time on replicas.
I seen it few times with a different application tables during
pg_compactable runs.
Reason simple: this empty space at the end of table not used anywhere so it
stay cold (outside of the shared buffers and file system cache).
I not sure (didn't checked actual code) but I have feeling that the replay
process (or OS) going to read every empty page before truncation all time
keeping AccessExclusive, and it might use a of lot time.
I think it might be useful to investigate what exact disk operations
performed during truncation scan, and try move some of the work outside of
AccessExclusive block.
(I not sure again but it seems that truncated (and cold) part of relation
will be read backward page by page from the end of file, such reading mode
have an abysmal performance on mechanical storage).


--=20
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.com/

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."
We saw this at Heroku a few times, one of which was posted here iirc.

I think the problem isn't that the truncate takes a long time. It's that it
deadlocks against the same tuple cleanups, possibly on the same table. It
was always the statistics table there too.

Recovery holding locks makes sense, and recovery pausing to avoid cleaning
tuples that are still needed is fine, buy the two cab deadlock against each
other. We could represent tuple cleanup with a lock when we need to block
which would at least let the deadlock detector notice but I'm not sure what
it should do since it can't kill recovery...

You can mitigate the problem by raising the vacuum cleanup age or using
feedback.

--
Greg
On 28 Jan 2016 1:43 pm, <dimitri@2ndquadrant.fr> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      13895
> Logged by:          Dimitri Fontaine
> Email address:      dimitri@2ndQuadrant.fr
> PostgreSQL version: 9.3.10
> Operating system:   Linux
> Description:
>
> I witnessed a case where I had no time to extract information about it, so
> it's all from memory, sorry about that.
>
> We had several Hot Standby nodes in 9.3.x used for load balancing read only
> queries. All queries were stuck, and pg_locks showed they were refused a
> lock against pg_catalog.pg_statistics.
>
> The lock owner PID was not to be found in the pg_stat_activity system view,
> for it was the replay process.
>
> I don't remember how I got to the conclusion it was replaying specifically
> autocavuum activity, but if that's helpful, that is what I remember it was
> about.
>
> My todo list includes extracting useful information next time it happens,
> before we just restart the standby (a restart fixes the availability /
> situation).
>
> Regards,
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>