Re: Higher level questions around shared memory stats - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Higher level questions around shared memory stats
Date
Msg-id 20220329210138.vzaxdzij2mi265pu@alap3.anarazel.de
Whole thread Raw
In response to Re: Higher level questions around shared memory stats  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Higher level questions around shared memory stats
Re: Higher level questions around shared memory stats
List pgsql-hackers
Hi,

On 2022-03-29 16:24:02 -0400, Robert Haas wrote:
> On Tue, Mar 29, 2022 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:
> > 1) How to react to corrupted statsfiles?
> >
> > IMO it'd make more sense to just throw away all stats
> > in that case.
>
> That seems reasonable to me. I think there's some downside, in that
> stats are important, and having some of them might be better than
> having none of them. On the other hand, stats file corruption should
> be very rare, and if it's not, we need to fix that.

I think it's reasonably rare because in cases there'd be corruption, we'd
typically not even have written them out / throw them away explicitly - we
only read stats when starting without crash recovery.

So the "expected" case of corruption afaicts solely is a OS crash just after
the shutdown checkpoint completed?


> I think what's actually most important here is the error reporting. We need
> to make it clear, at least via log messages, that something bad has
> happened.

The message currently (on HEAD, but similarly on the path) is:
                ereport(pgStatRunningInCollector ? LOG : WARNING,
                        (errmsg("corrupted statistics file \"%s\"",
                                statfile)));


> And maybe we should have, inside the stats system, something that
> keeps track of when the stats file was last recreated from scratch because
> of a corruption event, separately from when it was last intentionally reset.

That would be easy to add. Don't think we have a good place to show the
information right now - perhaps just new functions not part of any view?

I can think of these different times:

- Last time stats were removed due to starting up in crash recovery
- Last time stats were created from scratch, because no stats data file was
  present at startup
- Last time stats were thrown away due to corruption
- Last time a subset of stats were reset using one of the pg_reset* functions

Makes sense?


> > 2) What do we want to do with stats when starting up in recovery?
> >
> > Today we throw out stats whenever crash recovery is needed. Which includes
> > starting up a replica DB_SHUTDOWNED_IN_RECOVERY.
> >
> > The shared memory stats patch loads the stats in the startup process (whether
> > there's recovery or not). It's obviously no problem to just mirror the current
> > behaviour, we just need to decide what we want?
> >
> > The reason we throw stats away before recovery seem to originate in concerns
> > around old stats being confused for new stats. Of course, "now" that we have
> > streaming replication, throwing away stats *before* recovery, doesn't provide
> > any sort of protection against that. In fact, currently there's no automatic
> > mechanism whatsoever to get rid of stats for dropped objects on a standby.
>
> Does redo update the stats?

With "update" do you mean generate new stats? In the shared memory stats patch
it triggers stats to be dropped, on HEAD it just resets all stats at startup.

Redo itself doesn't generate stats, but bgwriter, checkpointer, backends do.


Thanks,

Andres



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Commitfest Update
Next
From: Robert Haas
Date:
Subject: Re: Commitfest Update