Thread: RFE: Make statistics robust for unplanned events
Hello PostgreSQL Hackers,
is it possible to preserve the PostgreSQL statistics on a server crash?
Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).
Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.
As far as I've checked, this would have to be implemented.
My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd like this to be considered a feature request.
Regards
On Tue, Apr 20, 2021 at 5:00 AM Patrik Novotny <panovotn@redhat.com> wrote: > As far as I've checked, this would have to be implemented. > > My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd likethis to be considered a feature request. I agree with you. Maybe crash safety would require some care in cases where autovacuum runs very frequently, so that the overhead isn't too high. But overall, non-crash-safe information that drives autovacuum is penny wise, pound foolish. I'm sure that it doesn't matter that much most of the time, but there are probably workloads and use cases where it causes significant and persistent problems. That's not the right trade-off IMV. -- Peter Geoghegan
On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote: > > Hello PostgreSQL Hackers, > > is it possible to preserve the PostgreSQL statistics on a server crash? > > Steps to reproduce the behaviour: > 1) Observe the statistics counters, take note > 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice > 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good). > > Resetting the counters to zero harms execution planning and auto_vacuum > operations. That can cause growth of database as dead tuples are not removed > at the right time. In the end the database can go offline if autovacuum never runs. The stats for the planner are store persistently in pg_stats though, but autovacuum definitely takes a hit from it, and several other things can too. > As far as I've checked, this would have to be implemented. > > My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd likethis to be considered a feature request. I'm pretty sure everybody would *want* this. At least nobody would be against it. The problem is the potential performance cost of it. Andres mentioned at least once over in the thread about shared memory stats collection that being able to have persistent stats could come out of that one in the future. Whatever is done on the topic should probably be done based on that work, as it provides a better starting point and also one that will stay around. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On 4/21/21 2:38 PM, Magnus Hagander wrote: > On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote: >> >> Hello PostgreSQL Hackers, >> >> is it possible to preserve the PostgreSQL statistics on a server crash? >> >> Steps to reproduce the behaviour: >> 1) Observe the statistics counters, take note >> 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice >> 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good). >> >> Resetting the counters to zero harms execution planning and auto_vacuum >> operations. That can cause growth of database as dead tuples are not removed >> at the right time. In the end the database can go offline if autovacuum never runs. > > The stats for the planner are store persistently in pg_stats though, > but autovacuum definitely takes a hit from it, and several other > things can too. > >> As far as I've checked, this would have to be implemented. >> I think the problem with planner stats is that after reset of the runtime stats we lose info about which tables may need analyze etc. and then fail to run ANALYZE in time. Which may have negative impact on performance, of course. >> My question would be whether there is something that would make >> this impossible to implement, and if there isn't, I'd like this to >> be considered a feature request. > > I'm pretty sure everybody would *want* this. At least nobody would be > against it. The problem is the potential performance cost of it. > > Andres mentioned at least once over in the thread about shared memory > stats collection that being able to have persistent stats could come > out of that one in the future. Whatever is done on the topic should > probably be done based on that work, as it provides a better starting > point and also one that will stay around. > Right. I think the other question is how often does this happen in practice - if your instance crashes often enough to make this an issue, then there are probably bigger issues. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 21, 2021 at 5:02 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 4/21/21 2:38 PM, Magnus Hagander wrote: > > On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote: > >> > >> Hello PostgreSQL Hackers, > >> > >> is it possible to preserve the PostgreSQL statistics on a server crash? > >> > >> Steps to reproduce the behaviour: > >> 1) Observe the statistics counters, take note > >> 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice > >> 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good). > >> > >> Resetting the counters to zero harms execution planning and auto_vacuum > >> operations. That can cause growth of database as dead tuples are not removed > >> at the right time. In the end the database can go offline if autovacuum never runs. > > > > The stats for the planner are store persistently in pg_stats though, > > but autovacuum definitely takes a hit from it, and several other > > things can too. > > > >> As far as I've checked, this would have to be implemented. > >> > > I think the problem with planner stats is that after reset of the > runtime stats we lose info about which tables may need analyze etc. and > then fail to run ANALYZE in time. Which may have negative impact on > performance, of course. > > >> My question would be whether there is something that would make > >> this impossible to implement, and if there isn't, I'd like this to > >> be considered a feature request. > > > > I'm pretty sure everybody would *want* this. At least nobody would be > > against it. The problem is the potential performance cost of it. > > > > Andres mentioned at least once over in the thread about shared memory > > stats collection that being able to have persistent stats could come > > out of that one in the future. Whatever is done on the topic should > > probably be done based on that work, as it provides a better starting > > point and also one that will stay around. > > > > Right. I think the other question is how often does this happen in > practice - if your instance crashes often enough to make this an issue, > then there are probably bigger issues. Agreed. I think the bigger problem there is replication failover, but that's also a different issue (keeping the statistics from the *standby* wouldn't help you much there, you'd need to replicate it from the primary). -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Wed, Apr 21, 2021 at 5:05 PM Magnus Hagander <magnus@hagander.net> wrote:
> Right. I think the other question is how often does this happen in
> practice - if your instance crashes often enough to make this an issue,
> then there are probably bigger issues.
Agreed.
I think the bigger problem there is replication failover, but that's
also a different issue (keeping the statistics from the *standby*
wouldn't help you much there, you'd need to replicate it from the
primary).
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
The report that I've received regarding this RFE has been triggered by experiencing issues with long term deployments in a large scale industrial environment. The point of this RFE is to be protected against those issues in the future. While this doesn't seem to be a very frequent occurrence, I wouldn't consider this a corner case not being worth attention.
If there is an expectation for the performance loss to be less of a problem in the future, would it make sense to make this an opt-in feature until then?
On Wed, Apr 21, 2021 at 5:39 AM Magnus Hagander <magnus@hagander.net> wrote: > I'm pretty sure everybody would *want* this. At least nobody would be > against it. The problem is the potential performance cost of it. VACUUM remembers vacrel->new_live_tuples as the pg_class.reltuples for the heap relation being vacuumed. It also remembers new_rel_pages in pg_class (see vac_update_relstats()). However, it does not remember vacrel->new_dead_tuples in pg_class or in any other durable location (the information gets remembered via a call to pgstat_report_vacuum() instead). We already *almost* pay the full cost of durably storing the information used by autovacuum.c's relation_needs_vacanalyze() to determine if a VACUUM is required -- we're only missing new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step further to fix this issue? Admittedly, storing new_dead_tuples durably is not sufficient to allow ANALYZE to be launched on schedule when there is a hard crash. It is also insufficient to make sure that insert-driven autovacuums get launched on schedule. Even still, I'm pretty sure that just making sure that we store it durably (alongside pg_class.reltuples?) will impose only a modest additional cost, while fixing Patrik's problem. That seems likely to be worth it. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > We already *almost* pay the full cost of durably storing the > information used by autovacuum.c's relation_needs_vacanalyze() to > determine if a VACUUM is required -- we're only missing > new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step > further to fix this issue? Definitely worth thinking about, but I'm a little confused about how you see this working. Those pg_class fields are updated by vacuum (or analyze) itself. How could they usefully serve as input to autovacuum's decisions? > Admittedly, storing new_dead_tuples durably is not sufficient to allow > ANALYZE to be launched on schedule when there is a hard crash. It is > also insufficient to make sure that insert-driven autovacuums get > launched on schedule. I'm not that worried about the former case, but the latter seems like kind of a problem. regards, tom lane
Hi, On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote: > Andres mentioned at least once over in the thread about shared memory > stats collection that being able to have persistent stats could come > out of that one in the future. Whatever is done on the topic should > probably be done based on that work, as it provides a better starting > point and also one that will stay around. Yea. I think the main benefit from the shared memory stats patch that would make this a easier is that it tracks (with one small hole that can probably be addressed) dropped objects accurately, even across crashes / replication. Having old stats around runs into danger of mixing stats for an old dropped object being combined with stats for a new object. I don't think making pgstat.c fully durable by continually storing the data in a table or something like that is an option. For one, the stats for a replica and primary are independent. For another, the overhead would be prohibitive. But after we gain the accurate dropping of stats we can store a stats snapshot corresponding to certain WAL records (by serializing to something like pg_stats_%redo_lsn%) without ending up with dropped stats surviving. A big question around this is how often we'd want to write out the stats. Obviously, the more often we do, the higher the overhead. And the less frequently, the more stats updates might be lost. Patrik, for your use cases, would loosing at most the stats since the start of last checkpoint be an issue? Greetings, Andres Freund
On Fri, Apr 23, 2021 at 12:41 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote: > > Andres mentioned at least once over in the thread about shared memory > > stats collection that being able to have persistent stats could come > > out of that one in the future. Whatever is done on the topic should > > probably be done based on that work, as it provides a better starting > > point and also one that will stay around. > > Yea. I think the main benefit from the shared memory stats patch that > would make this a easier is that it tracks (with one small hole that can > probably be addressed) dropped objects accurately, even across crashes / > replication. Having old stats around runs into danger of mixing stats > for an old dropped object being combined with stats for a new object. > > I don't think making pgstat.c fully durable by continually storing the > data in a table or something like that is an option. For one, the stats > for a replica and primary are independent. For another, the overhead > would be prohibitive. > > But after we gain the accurate dropping of stats we can store a stats > snapshot corresponding to certain WAL records (by serializing to > something like pg_stats_%redo_lsn%) without ending up with dropped stats > surviving. > > A big question around this is how often we'd want to write out the > stats. Obviously, the more often we do, the higher the overhead. And the > less frequently, the more stats updates might be lost. Yeah, that's what I was thinking as well -- dumping snapshot at regular intervals, so that on crash recovery we lose a "controlled amount" of recent starts instead of losing *everything*. I think in most situations a fairly long interval is OK -- if you have tables that take so many hits that you need a really quick reaction from autovacuum it will probably pick that up quickly enough even after a reset. And if it's moer the long-term tracking that's important, then a longer interval is probably OK. But perhaps make it configurable with a timeout and a "reasonable default"? > Patrik, for your use cases, would loosing at most the stats since the > start of last checkpoint be an issue? Unless there's a particular benefit to tie it specifically to the checkpoint occuring, I'd rather keep it as a separate setting. They might both come with the same default of course, btu I can certainly envision cases where you want to increase the checkpoint distance while keeping the stats interval lower for example. Many people increase the checkpoint timeout quite a lot... -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.
I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.
But perhaps make it configurable with a timeout and a "reasonable default"?
> Patrik, for your use cases, would loosing at most the stats since the
> start of last checkpoint be an issue?
Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...
From what I understand, I think it depends on the interval of those checkpoints. If the interval was configurable with the mentioned reasonable default, then it shouldn't be an issue.
If we were to choose a hard coded interval of those checkpoints based on my case, I would have to consult the original reporter, but then it might not suit others anyway. Therefore, making it configurable makes more sense to me personally.
On Thu, Apr 22, 2021 at 3:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > We already *almost* pay the full cost of durably storing the > > information used by autovacuum.c's relation_needs_vacanalyze() to > > determine if a VACUUM is required -- we're only missing > > new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step > > further to fix this issue? > > Definitely worth thinking about, but I'm a little confused about how > you see this working. Those pg_class fields are updated by vacuum > (or analyze) itself. How could they usefully serve as input to > autovacuum's decisions? Honestly, the details weren't very well thought out. My basic point still stands, which is that it shouldn't be *that* expensive to make the relevant information crash-safe, which would protect the system from certain pathological cases. Maybe it doesn't even have to be crash-safe in the way that pg_class.reltuples is -- something approximate might work quite well. Assuming that there are no dead tuples after a crash seems rather naive. I seem to recall that certain init scripts I saw years ago used Immediate Shutdown mode very frequently. Stuff like that is bound to happen in some installations, and so we should protect users from hard-to-foresee extreme consequences. Sure, using immediate shutdown like that isn't recommended practice, but that's no reason to allow a nasty failure mode. -- Peter Geoghegan