Thread: Recovery conflict monitoring
This patch adds counters and views to monitor hot standby generated recovery conflicts. It extends the pg_stat_database view with one column with the total number of conflicts, and also creates a new view pg_stat_database_conflicts that contains a breakdown of exactly what caused the conflicts. Documentation still pending, but comments meanwhile is of course appreciated ;) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Attachment
On Thu, Dec 23, 2010 at 13:09, Magnus Hagander <magnus@hagander.net> wrote: > This patch adds counters and views to monitor hot standby generated > recovery conflicts. It extends the pg_stat_database view with one > column with the total number of conflicts, and also creates a new view > pg_stat_database_conflicts that contains a breakdown of exactly what > caused the conflicts. > > Documentation still pending, but comments meanwhile is of course appreciated ;) Heikki pointed out over IM that it's pointless to count stats caused by recovery conflict with drop database - since we drop the stats record as soon as it arrives anyway. Here's an updated patch that removes that, and also adds some documentation. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Attachment
On Mon, 2010-12-27 at 14:39 +0100, Magnus Hagander wrote: > On Thu, Dec 23, 2010 at 13:09, Magnus Hagander <magnus@hagander.net> wrote: > > This patch adds counters and views to monitor hot standby generated > > recovery conflicts. It extends the pg_stat_database view with one > > column with the total number of conflicts, and also creates a new view > > pg_stat_database_conflicts that contains a breakdown of exactly what > > caused the conflicts. > > > > Documentation still pending, but comments meanwhile is of course appreciated ;) > > Heikki pointed out over IM that it's pointless to count stats caused > by recovery conflict with drop database - since we drop the stats > record as soon as it arrives anyway. Here's an updated patch that > removes that, and also adds some documentation. I like the patch, well inspired, code in the right places AFAICS. No code comments at all. Couple of thoughts: * are we safe to issue stats immediately before issuing FATAL? Won't some of them get lost? * Not clear what I'd do with database level information, except worry a lot. Maybe an option to count conflicts per user would be better, since at least we'd know exactly who was affected by those. Just an idea. * Would it better to have a log_standby_conflicts that allowed the opportunity to log the conflicting SQL, duration until cancelation etc? I'd rather have what you have than nothing at all though... the new hot_standby_feedback mode should be acting to reduce these, so it would be useful to have this patch enabled for testing that feature. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Jan 3, 2011 at 00:23, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2010-12-27 at 14:39 +0100, Magnus Hagander wrote: >> On Thu, Dec 23, 2010 at 13:09, Magnus Hagander <magnus@hagander.net> wrote: >> > This patch adds counters and views to monitor hot standby generated >> > recovery conflicts. It extends the pg_stat_database view with one >> > column with the total number of conflicts, and also creates a new view >> > pg_stat_database_conflicts that contains a breakdown of exactly what >> > caused the conflicts. >> > >> > Documentation still pending, but comments meanwhile is of course appreciated ;) >> >> Heikki pointed out over IM that it's pointless to count stats caused >> by recovery conflict with drop database - since we drop the stats >> record as soon as it arrives anyway. Here's an updated patch that >> removes that, and also adds some documentation. > > I like the patch, well inspired, code in the right places AFAICS. No > code comments at all. Thanks for reviewing! > Couple of thoughts: > > * are we safe to issue stats immediately before issuing FATAL? Won't > some of them get lost? They shouldn't - not more than other stats messages. Those are often flushed from on_shmem_exit() which I think runs even later. > * Not clear what I'd do with database level information, except worry a > lot. Maybe an option to count conflicts per user would be better, since > at least we'd know exactly who was affected by those. Just an idea. Depends on the usage scenario. In a lot of dedicated environments you really only have one database - but there are many environments where you do have multiple and it's quite useful to see them separately. And you can of course very easily sum() them up for a total count, since it's a view... Better keep the detail than throw it away, even if that part isn't useful in *all* cases... Grouping by user would potentially be helpful - I agree. However, that goes for most pgstat counters ("number of seqscans", "tuples read" etc are interesting per user as well in some cases). So doing that right would mean adding per-user tracking all across pgstats in some smart way - something we don't do now at all. So I see that as a separate issue. > * Would it better to have a log_standby_conflicts that allowed the > opportunity to log the conflicting SQL, duration until cancelation etc? Logging is useful to figure out why you have a certain scenario, yes. But absolutely not as a *replacement* for the statistics counters, but as an addition. Just like we have (the now incorrectly named) pg_stat_bgwriter view *and* log_checkpoints... Different usecases for the same basic information. > I'd rather have what you have than nothing at all though... the new > hot_standby_feedback mode should be acting to reduce these, so it would > be useful to have this patch enabled for testing that feature. It will help reduce it, but not take it away, right? Plus, it's an optional feature... -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
On Mon, 2011-01-03 at 10:03 +0100, Magnus Hagander wrote: > > I like the patch, well inspired, code in the right places AFAICS. No > > code comments at all. > > Thanks for reviewing! All good here. Test and commit please. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
Couple of doc suggestions: --- doc/src/sgml/high-availability.sgml + The number of query cancels and the reason for them can be viewed using + the <structname>pg_stat_database_conflicts</> system view on the slave + server. For compleness sake, this should also mention the per-database summary, even though I'm not sure how valuable that view is. Also, "on a standby server" instead of "on the slave server" here. "slave" is mentioned once as a synonym in high-availability.sgml once, but that's it, and there can be more than one standby you want to pull these stats from. *** doc/src/sgml/monitoring.sgml ! number of rows returned, fetched, inserted, updated and deleted, and ! total number of queries cancelled due to conflict with recovery. This would be clearer if it said you're talking about standby recovery here, and possibly that this info is only available on the standby. I could see someone reading this and thinking it's possible for general database crash recovery to produce cancelled queries, instead of the way connections are actually blocked until that's done. ! <entry><structname>pg_stat_database_conflicts</> ! <entry>One row per database, showing database OID, database name and ! the number of queries that have been cancelled in this database due to ! dropped tablespaces, lock timeouts, old snapshots, pinned buffers and ! deadlocks. A clarification that you're talking about standby query cancellation here might be helpful too. I don't think that's necessary for all of the detailed pg_stat_get_* functions that regular users are less likely to care about, just these higher level ones. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Mon, Jan 3, 2011 at 11:35, Greg Smith <greg@2ndquadrant.com> wrote: > Couple of doc suggestions: > > --- doc/src/sgml/high-availability.sgml > + The number of query cancels and the reason for them can be viewed > using > + the <structname>pg_stat_database_conflicts</> system view on the slave > + server. > > For compleness sake, this should also mention the per-database summary, even > though I'm not sure how valuable that view is. Also, "on a standby server" > instead of "on the slave server" here. "slave" is mentioned once as a > synonym in high-availability.sgml once, but that's it, and there can be more > than one standby you want to pull these stats from. Good point, changed and added. > *** doc/src/sgml/monitoring.sgml > ! number of rows returned, fetched, inserted, updated and deleted, and > ! total number of queries cancelled due to conflict with recovery. > > This would be clearer if it said you're talking about standby recovery here, > and possibly that this info is only available on the standby. I could see > someone reading this and thinking it's possible for general database crash > recovery to produce cancelled queries, instead of the way connections are > actually blocked until that's done. > > ! <entry><structname>pg_stat_database_conflicts</> > ! <entry>One row per database, showing database OID, database name and > ! the number of queries that have been cancelled in this database due > to > ! dropped tablespaces, lock timeouts, old snapshots, pinned buffers > and > ! deadlocks. > > A clarification that you're talking about standby query cancellation here > might be helpful too. I don't think that's necessary for all of the > detailed pg_stat_get_* functions that regular users are less likely to care > about, just these higher level ones. Yeah, those both make sense - I've updated the docs and am running tests ATM. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/