Hi,
On 2021-04-21 16:28:26 -0400, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2021-04-21 15:51:38 -0400, Stephen Frost wrote:
> > > It does seem like we have some trade-offs here to weigh, but
> > > pg_control is indeed quite small..
> >
> > What do you mean by that? That the overhead of writing it out more
> > frequently wouldn't be too bad? Or that we shouldn't "unnecessarily" add
> > more fields to it?
>
> Mostly just that the added overhead in writing it out more frequently
> wouldn't be too bad.
>
> Seems the missing bit here is "how often, and how do we make that
> happen?" and then we can discuss if there's reason to be concerned that
> it would be 'too frequent' or cause too much additional overhead in
> terms of IO/fsyncs.
The number of writes and the number of fsyncs of the control file
wouldn't necessarily have to be the same. We could e.g. update the file
once per segment, but only fsync it at a lower cadence. We already rely
on handling writes-without-fsync of the control file (which is trivial
due to the <= 512 byte limit).
Another interesting question is where we'd do the update from. It seems
like it ought to be some background process:
I can see doing it in the checkpointer - but there's a few phases that
can take a while (e.g. sync) where currently don't call something like
CheckpointWriteDelay() on a regular basis.
I also can see doing it in bgwriter - none of the work it does should
take all that long, and minor increases in latency ought not to have
much of an impact.
Wal writer seems less suitable, some workloads are sensitive to it not
getting around doing what it ought to do.
> Adding fields runs the risk of crossing the
> threshold where we feel that we can safely assume all of it will make it
> to disk in one shot and therefore there's more reason to not add extra
> fields to it, if possible.
Yea, we really should stay below 512 bytes (sector size). We're at 296
right now, with 20 bytes lost to padding. If we got close to the limit
we could easily move some of the contents out of pg_control - we
e.g. don't need to write out all the compile time values all the time,
they could live in a file similar to PG_VERSION instead. So I'm not too
concerned right now. But we also don't need to add anything, given that
we already have minRecoveryPoint.
Greetings,
Andres Freund