Re: Flush pgstats file during checkpoints - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Flush pgstats file during checkpoints
Date
Msg-id Zod8D_w-ia9PSrCX@paquier.xyz
Whole thread Raw
In response to Re: Flush pgstats file during checkpoints  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Flush pgstats file during checkpoints
Re: Flush pgstats file during checkpoints
List pgsql-hackers
On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
> I think those are two independent issues - knowing that the snapshot is
> from the last checkpoint, and knowing that it's correct (not corrupted).
> And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well.  I don't really get why we would not
want durability at shutdown for this data.  So how about switching the
end of pgstat_write_statsfile() to use durable_rename()?  Sounds like
an independent change, worth on its own.

> Yeah, I was wondering about the same thing - can't this mean we fail to
> start autovacuum? Let's say we delete a significant fraction of a huge
> table, but then we crash before the next checkpoint. With this patch we
> restore the last stats snapshot, which can easily have
>
> n_dead_tup          | 0
> n_mod_since_analyze | 0
>
> for the table. And thus we fail to notice the table needs autovacuum.
> AFAIK we run autovacuum on all tables with missing stats (or am I
> wrong?). That's what's happening on replicas after switchover/failover
> too, right?

That's the opposite, please look at relation_needs_vacanalyze().  If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.

For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats.  Again, there should be a stats entry post a
crash upon a reconnection.

So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no?  In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it, but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash).  The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats.  That sounds like a win for me with steady workloads, less
for bulby workloads..

> It'd not be such an issue if we updated stats during recovery, but I
> think think we're doing that. Perhaps we should, which might also help
> on replicas - no idea if it's feasible, though.

Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries).  If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery.  Not sure if that's worth the complication.  First,
the stats exist at node-level.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Logical Replication of sequences
Next
From: David Rowley
Date:
Subject: Re: Use generation memory context for tuplestore.c