Thread: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
From
dimitri@2ndQuadrant.fr
Date:
The following bug has been logged on the website: Bug reference: 13895 Logged by: Dimitri Fontaine Email address: dimitri@2ndQuadrant.fr PostgreSQL version: 9.3.10 Operating system: Linux Description: I witnessed a case where I had no time to extract information about it, so it's all from memory, sorry about that. We had several Hot Standby nodes in 9.3.x used for load balancing read only queries. All queries were stuck, and pg_locks showed they were refused a lock against pg_catalog.pg_statistics. The lock owner PID was not to be found in the pg_stat_activity system view, for it was the replay process. I don't remember how I got to the conclusion it was replaying specifically autocavuum activity, but if that's helpful, that is what I remember it was about. My todo list includes extracting useful information next time it happens, before we just restart the standby (a restart fixes the availability / situation). Regards,
Re: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
From
Alvaro Herrera
Date:
dimitri@2ndQuadrant.fr wrote: > > I witnessed a case where I had no time to extract information about it, so > it's all from memory, sorry about that. > > We had several Hot Standby nodes in 9.3.x used for load balancing read only > queries. All queries were stuck, and pg_locks showed they were refused a > lock against pg_catalog.pg_statistics. My WAG is that this is related to the standby replaying the AccessExclusive lock obtained to truncate pg_statistics. That seems consistent with the presented symptoms. I would have assumed that the actual truncation shouldn't take a particularly long time, so user processes on the standby wouldn't be locked for long enough to cause visible trouble. Maybe the table was originally very bloated and a lot of pages got removed, leading the replay process to take some time. It is strange that it would be locked for long enough to allow for manually querying pg_locks and pg_stat_activity. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
From
Dimitri Fontaine
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > replay process to take some time. It is strange that it would be locked > for long enough to allow for manually querying pg_locks and > pg_stat_activity. We restarted the standby after maybe 35 mins of application downtime due to the incident=E2=80=A6 the diagnostic and restart did happen a tad late i= n the process on that ocasion, as it sometimes happens. So the lock has been kept about that long. --=20 Dimitri Fontaine 06 63 07 10 78 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
From
Maxim Boguk
Date:
On Thu, Jan 28, 2016 at 11:53 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > dimitri@2ndQuadrant.fr wrote: > > > > > I witnessed a case where I had no time to extract information about it, > so > > it's all from memory, sorry about that. > > > > We had several Hot Standby nodes in 9.3.x used for load balancing read > only > > queries. All queries were stuck, and pg_locks showed they were refused = a > > lock against pg_catalog.pg_statistics. > > My WAG is that this is related to the standby replaying the > AccessExclusive lock obtained to truncate pg_statistics. That seems > consistent with the presented symptoms. > > I would have assumed that the actual truncation shouldn't take a > particularly long time, so user processes on the standby wouldn't be > locked for long enough to cause visible trouble. Maybe the table was > originally very bloated and a lot of pages got removed, leading the > replay process to take some time. It is strange that it would be locked > for long enough to allow for manually querying pg_locks and > pg_stat_activity. > > -- > =C3=81lvaro Herrera http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs > =E2=80=8BIf pg_statistics had been seriously bloated (like few GB of empty = space at the end), and in the same time mechanical (and busy) HDD used for the database, the vacuum truncation could require really long time on replicas. I seen it few times with a different application tables during pg_compactable runs. Reason simple: this empty space at the end of table not used anywhere so it stay cold (outside of the shared buffers and file system cache). I not sure (didn't checked actual code) but I have feeling that the replay process (or OS) going to read every empty page before truncation all time keeping AccessExclusive, and it might use a of lot time. I think it might be useful to investigate what exact disk operations performed during truncation scan, and try move some of the work outside of AccessExclusive block. (I not sure again but it seems that truncated (and cold) part of relation will be read backward page by page from the end of file, such reading mode have an abysmal performance on mechanical storage). --=20 Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.com/ Phone RU: +7 910 405 4718 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk Jabber: maxim.boguk@gmail.com "People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."
Re: BUG #13895: Hot Standby locked replaying auto vacuum against pg_catalog.pg_statistics
From
Greg Stark
Date:
We saw this at Heroku a few times, one of which was posted here iirc. I think the problem isn't that the truncate takes a long time. It's that it deadlocks against the same tuple cleanups, possibly on the same table. It was always the statistics table there too. Recovery holding locks makes sense, and recovery pausing to avoid cleaning tuples that are still needed is fine, buy the two cab deadlock against each other. We could represent tuple cleanup with a lock when we need to block which would at least let the deadlock detector notice but I'm not sure what it should do since it can't kill recovery... You can mitigate the problem by raising the vacuum cleanup age or using feedback. -- Greg On 28 Jan 2016 1:43 pm, <dimitri@2ndquadrant.fr> wrote: > The following bug has been logged on the website: > > Bug reference: 13895 > Logged by: Dimitri Fontaine > Email address: dimitri@2ndQuadrant.fr > PostgreSQL version: 9.3.10 > Operating system: Linux > Description: > > I witnessed a case where I had no time to extract information about it, so > it's all from memory, sorry about that. > > We had several Hot Standby nodes in 9.3.x used for load balancing read only > queries. All queries were stuck, and pg_locks showed they were refused a > lock against pg_catalog.pg_statistics. > > The lock owner PID was not to be found in the pg_stat_activity system view, > for it was the replay process. > > I don't remember how I got to the conclusion it was replaying specifically > autocavuum activity, but if that's helpful, that is what I remember it was > about. > > My todo list includes extracting useful information next time it happens, > before we just restart the standby (a restart fixes the availability / > situation). > > Regards, > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs >