Thread: RFE: Make statistics robust for unplanned events

RFE: Make statistics robust for unplanned events

From
Patrik Novotny
Date:
Hello PostgreSQL Hackers,

is it possible to preserve the PostgreSQL statistics on a server crash?

Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).

Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.

As far as I've checked, this would have to be implemented.

My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd like this to be considered a feature request.


Regards

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com  

Re: RFE: Make statistics robust for unplanned events

From
Peter Geoghegan
Date:
On Tue, Apr 20, 2021 at 5:00 AM Patrik Novotny <panovotn@redhat.com> wrote:
> As far as I've checked, this would have to be implemented.
>
> My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd
likethis to be considered a feature request.
 

I agree with you.

Maybe crash safety would require some care in cases where autovacuum
runs very frequently, so that the overhead isn't too high. But
overall, non-crash-safe information that drives autovacuum is penny
wise, pound foolish.

I'm sure that it doesn't matter that much most of the time, but there
are probably workloads and use cases where it causes significant and
persistent problems. That's not the right trade-off IMV.

-- 
Peter Geoghegan



Re: RFE: Make statistics robust for unplanned events

From
Magnus Hagander
Date:
On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:
>
> Hello PostgreSQL Hackers,
>
> is it possible to preserve the PostgreSQL statistics on a server crash?
>
> Steps to reproduce the behaviour:
> 1) Observe the statistics counters, take note
> 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
> 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved
(Good).
>
> Resetting the counters to zero harms execution planning and auto_vacuum
> operations. That can cause growth of database as dead tuples are not removed
> at the right time. In the end the database can go offline if autovacuum never runs.

The stats for the planner are store persistently in pg_stats though,
but autovacuum definitely takes a hit from it, and several other
things can too.

> As far as I've checked, this would have to be implemented.
>
> My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd
likethis to be considered a feature request.
 

I'm pretty sure everybody would *want* this. At least nobody would be
against it. The problem is the potential performance cost of it.

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: RFE: Make statistics robust for unplanned events

From
Tomas Vondra
Date:

On 4/21/21 2:38 PM, Magnus Hagander wrote:
> On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:
>>
>> Hello PostgreSQL Hackers,
>>
>> is it possible to preserve the PostgreSQL statistics on a server crash?
>>
>> Steps to reproduce the behaviour:
>> 1) Observe the statistics counters, take note
>> 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
>> 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved
(Good).
>>
>> Resetting the counters to zero harms execution planning and auto_vacuum
>> operations. That can cause growth of database as dead tuples are not removed
>> at the right time. In the end the database can go offline if autovacuum never runs.
> 
> The stats for the planner are store persistently in pg_stats though,
> but autovacuum definitely takes a hit from it, and several other
> things can too.
> 
>> As far as I've checked, this would have to be implemented.
>>

I think the problem with planner stats is that after reset of the
runtime stats we lose info about which tables may need analyze etc. and
then fail to run ANALYZE in time. Which may have negative impact on
performance, of course.

>> My question would be whether there is something that would make 
>> this impossible to implement, and if there isn't, I'd like this to
>> be considered a feature request.
> 
> I'm pretty sure everybody would *want* this. At least nobody would be
> against it. The problem is the potential performance cost of it.
> 
> Andres mentioned at least once over in the thread about shared memory
> stats collection that being able to have persistent stats could come
> out of that one in the future. Whatever is done on the topic should
> probably be done based on that work, as it provides a better starting
> point and also one that will stay around.
> 

Right. I think the other question is how often does this happen in
practice - if your instance crashes often enough to make this an issue,
then there are probably bigger issues.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: RFE: Make statistics robust for unplanned events

From
Magnus Hagander
Date:
On Wed, Apr 21, 2021 at 5:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 4/21/21 2:38 PM, Magnus Hagander wrote:
> > On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:
> >>
> >> Hello PostgreSQL Hackers,
> >>
> >> is it possible to preserve the PostgreSQL statistics on a server crash?
> >>
> >> Steps to reproduce the behaviour:
> >> 1) Observe the statistics counters, take note
> >> 2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
> >> 3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved
(Good).
> >>
> >> Resetting the counters to zero harms execution planning and auto_vacuum
> >> operations. That can cause growth of database as dead tuples are not removed
> >> at the right time. In the end the database can go offline if autovacuum never runs.
> >
> > The stats for the planner are store persistently in pg_stats though,
> > but autovacuum definitely takes a hit from it, and several other
> > things can too.
> >
> >> As far as I've checked, this would have to be implemented.
> >>
>
> I think the problem with planner stats is that after reset of the
> runtime stats we lose info about which tables may need analyze etc. and
> then fail to run ANALYZE in time. Which may have negative impact on
> performance, of course.
>
> >> My question would be whether there is something that would make
> >> this impossible to implement, and if there isn't, I'd like this to
> >> be considered a feature request.
> >
> > I'm pretty sure everybody would *want* this. At least nobody would be
> > against it. The problem is the potential performance cost of it.
> >
> > Andres mentioned at least once over in the thread about shared memory
> > stats collection that being able to have persistent stats could come
> > out of that one in the future. Whatever is done on the topic should
> > probably be done based on that work, as it provides a better starting
> > point and also one that will stay around.
> >
>
> Right. I think the other question is how often does this happen in
> practice - if your instance crashes often enough to make this an issue,
> then there are probably bigger issues.

Agreed.

I think the bigger problem there is replication failover, but that's
also a different issue (keeping the statistics from the *standby*
wouldn't help you much there, you'd need to replicate it from the
primary).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: RFE: Make statistics robust for unplanned events

From
Patrik Novotny
Date:
On Wed, Apr 21, 2021 at 5:05 PM Magnus Hagander <magnus@hagander.net> wrote:

> Right. I think the other question is how often does this happen in
> practice - if your instance crashes often enough to make this an issue,
> then there are probably bigger issues.

Agreed.

I think the bigger problem there is replication failover, but that's
also a different issue (keeping the statistics from the *standby*
wouldn't help you much there, you'd need to replicate it from the
primary).

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/


The report that I've received regarding this RFE has been triggered by experiencing issues with long term deployments in a large scale industrial environment. The point of this RFE is to be protected against those issues in the future. While this doesn't seem to be a very frequent occurrence, I wouldn't consider this a corner case not being worth attention.

If there is an expectation for the performance loss to be less of a problem in the future, would it make sense to make this an opt-in feature until then?

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com  

Re: RFE: Make statistics robust for unplanned events

From
Peter Geoghegan
Date:
On Wed, Apr 21, 2021 at 5:39 AM Magnus Hagander <magnus@hagander.net> wrote:
> I'm pretty sure everybody would *want* this. At least nobody would be
> against it. The problem is the potential performance cost of it.

VACUUM remembers vacrel->new_live_tuples as the pg_class.reltuples for
the heap relation being vacuumed. It also remembers new_rel_pages in
pg_class (see vac_update_relstats()). However, it does not remember
vacrel->new_dead_tuples in pg_class or in any other durable location
(the information gets remembered via a call to pgstat_report_vacuum()
instead).

We already *almost* pay the full cost of durably storing the
information used by autovacuum.c's relation_needs_vacanalyze() to
determine if a VACUUM is required -- we're only missing
new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
further to fix this issue?

Admittedly, storing new_dead_tuples durably is not sufficient to allow
ANALYZE to be launched on schedule when there is a hard crash. It is
also insufficient to make sure that insert-driven autovacuums get
launched on schedule. Even still, I'm pretty sure that just making
sure that we store it durably (alongside pg_class.reltuples?) will
impose only a modest additional cost, while fixing Patrik's problem.
That seems likely to be worth it.

-- 
Peter Geoghegan



Re: RFE: Make statistics robust for unplanned events

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> We already *almost* pay the full cost of durably storing the
> information used by autovacuum.c's relation_needs_vacanalyze() to
> determine if a VACUUM is required -- we're only missing
> new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
> further to fix this issue?

Definitely worth thinking about, but I'm a little confused about how
you see this working.  Those pg_class fields are updated by vacuum
(or analyze) itself.  How could they usefully serve as input to
autovacuum's decisions?

> Admittedly, storing new_dead_tuples durably is not sufficient to allow
> ANALYZE to be launched on schedule when there is a hard crash. It is
> also insufficient to make sure that insert-driven autovacuums get
> launched on schedule.

I'm not that worried about the former case, but the latter seems
like kind of a problem.

            regards, tom lane



Re: RFE: Make statistics robust for unplanned events

From
Andres Freund
Date:
Hi,

On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote:
> Andres mentioned at least once over in the thread about shared memory
> stats collection that being able to have persistent stats could come
> out of that one in the future. Whatever is done on the topic should
> probably be done based on that work, as it provides a better starting
> point and also one that will stay around.

Yea. I think the main benefit from the shared memory stats patch that
would make this a easier is that it tracks (with one small hole that can
probably be addressed) dropped objects accurately, even across crashes /
replication. Having old stats around runs into danger of mixing stats
for an old dropped object being combined with stats for a new object.

I don't think making pgstat.c fully durable by continually storing the
data in a table or something like that is an option. For one, the stats
for a replica and primary are independent. For another, the overhead
would be prohibitive.

But after we gain the accurate dropping of stats we can store a stats
snapshot corresponding to certain WAL records (by serializing to
something like pg_stats_%redo_lsn%) without ending up with dropped stats
surviving.

A big question around this is how often we'd want to write out the
stats. Obviously, the more often we do, the higher the overhead. And the
less frequently, the more stats updates might be lost.


Patrik, for your use cases, would loosing at most the stats since the
start of last checkpoint be an issue?

Greetings,

Andres Freund



Re: RFE: Make statistics robust for unplanned events

From
Magnus Hagander
Date:
On Fri, Apr 23, 2021 at 12:41 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote:
> > Andres mentioned at least once over in the thread about shared memory
> > stats collection that being able to have persistent stats could come
> > out of that one in the future. Whatever is done on the topic should
> > probably be done based on that work, as it provides a better starting
> > point and also one that will stay around.
>
> Yea. I think the main benefit from the shared memory stats patch that
> would make this a easier is that it tracks (with one small hole that can
> probably be addressed) dropped objects accurately, even across crashes /
> replication. Having old stats around runs into danger of mixing stats
> for an old dropped object being combined with stats for a new object.
>
> I don't think making pgstat.c fully durable by continually storing the
> data in a table or something like that is an option. For one, the stats
> for a replica and primary are independent. For another, the overhead
> would be prohibitive.
>
> But after we gain the accurate dropping of stats we can store a stats
> snapshot corresponding to certain WAL records (by serializing to
> something like pg_stats_%redo_lsn%) without ending up with dropped stats
> surviving.
>
> A big question around this is how often we'd want to write out the
> stats. Obviously, the more often we do, the higher the overhead. And the
> less frequently, the more stats updates might be lost.

Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.

I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.

But perhaps make it configurable with a timeout and a "reasonable default"?


> Patrik, for your use cases, would loosing at most the stats since the
> start of last checkpoint be an issue?

Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: RFE: Make statistics robust for unplanned events

From
Patrik Novotny
Date:

Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.

I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.

But perhaps make it configurable with a timeout and a "reasonable default"?


> Patrik, for your use cases, would loosing at most the stats since the
> start of last checkpoint be an issue?

Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...


From what I understand, I think it depends on the interval of those checkpoints. If the interval was configurable with the mentioned reasonable default, then it shouldn't be an issue.

If we were to choose a hard coded interval of those checkpoints based on my case, I would have to consult the original reporter, but then it might not suit others anyway. Therefore, making it configurable makes more sense to me personally.

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com  

Re: RFE: Make statistics robust for unplanned events

From
Peter Geoghegan
Date:
On Thu, Apr 22, 2021 at 3:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > We already *almost* pay the full cost of durably storing the
> > information used by autovacuum.c's relation_needs_vacanalyze() to
> > determine if a VACUUM is required -- we're only missing
> > new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
> > further to fix this issue?
>
> Definitely worth thinking about, but I'm a little confused about how
> you see this working.  Those pg_class fields are updated by vacuum
> (or analyze) itself.  How could they usefully serve as input to
> autovacuum's decisions?

Honestly, the details weren't very well thought out. My basic point
still stands, which is that it shouldn't be *that* expensive to make
the relevant information crash-safe, which would protect the system
from certain pathological cases. Maybe it doesn't even have to be
crash-safe in the way that pg_class.reltuples is -- something
approximate might work quite well. Assuming that there are no dead
tuples after a crash seems rather naive.

I seem to recall that certain init scripts I saw years ago used
Immediate Shutdown mode very frequently. Stuff like that is bound to
happen in some installations, and so we should protect users from
hard-to-foresee extreme consequences. Sure, using immediate shutdown
like that isn't recommended practice, but that's no reason to allow a
nasty failure mode.

--
Peter Geoghegan