Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch - Mailing list pgsql-hackers
From
Shayon Mukherjee
Subject
Re: Prolonged truncation phase during vacuum on toast table with repeated interruptions by lock waiters and a proposed POC patch
> actually determine you are in this situation, or how bad the situation > was, in order to know that setting truncate off would help? To that
That's definitely a sound idea and the way I was able to replicate and learn myself was by doing a PATCH that would just print the # of attempts it was doing. That said, just to add more color - I was coming at this from a point of view where truncation would be on a "best effort" basis and the trade off here would be less availability risk at the cost of missed truncations. My first instinct was that, having vacuum_truncate set to ON by default on busy systems that have a lot of churn can catch a lot of developers and PostgreSQL by surprise (sure did to me :D), so the best effort and "fail fast" behavior could help, however at the same time I agree that it doesn't solve either of the problems really well. Another idea I had around this section was to expose an attribute/guc/setting like VACUUM_TRUNCATE_INTERRUPTION_MAX_RETRIES with a default of `1` or something, that users could optionally configure, perhaps that could be a middleground, but I am myself not fully sold on the idea either (?).
Adding counters for this area is not a bad idea in general, as this hits customers particularly hard on hot standbys when the truncate does actually occur on the primary.
What about adding cumulative counters ( per table and pg_stat_database ) such as:
pages_vac_truncated - # of pages truncated by vacuum vac_truncate_conflicts - # of time truncate was skipped due to conflict vac_truncate_suspended - # of times the truncate was suspended.
The difference between conflict and suspended is conflict causes vacuum to skip the truncate phase while suspended causes vacuum to retry the phase continually.
Would vac_truncate_retries or something similar be more immediately clear? If so, maybe something like
- vac_truncate_interruptions - # of time truncate was skipped due to conflict and not retried
- vac_truncate_retries - # of times the truncate was suspended and retried
The argument against adding these counters is that the views, pg_stat_all_tables/ pg_stat_all_database are becoming super-wide, so maybe we need to think about inventing a new view for vacuum related counter metrics. This seems like a good discussion for v19.
+1 for this. I'd even be happy to even just start showing the attempts as part of INFO or VERBOSE on conflicts & suspensions, without keeping track, but it is useful statistical data.