Re: Large files for relations - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Large files for relations
Date
Msg-id ZFrAy0zAeeEz8yRD@tamriel.snowman.net
Whole thread Raw
In response to Re: Large files for relations  (Corey Huinker <corey.huinker@gmail.com>)
List pgsql-hackers
Greetings,

* Corey Huinker (corey.huinker@gmail.com) wrote:
> On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com>
> > wrote:
> > > rsync --link-dest

... rsync isn't really a safe tool to use for PG backups by itself
unless you're using it with archiving and with start/stop backup and
with checksums enabled.

> > I wonder if rsync will grow a mode that can use copy_file_range() to
> > share blocks with a reference file (= previous backup).  Something
> > like --copy-range-dest.  That'd work for large-file relations
> > (assuming a file system that has block sharing, like XFS and ZFS).
> > You wouldn't get the "mtime is enough, I don't even need to read the
> > bytes" optimisation, which I assume makes all database hackers feel a
> > bit queasy anyway, but you'd get the space savings via the usual
> > rolling checksum or a cheaper version that only looks for strong
> > checksum matches at the same offset, or whatever other tricks rsync
> > might have up its sleeve.

There's also really good reasons to have multiple full backups and not
just a single full backup and then lots and lots of incrementals which
basically boils down to "are you really sure that one copy of that one
really important file won't every disappear from your backup
repository..?"

That said, pgbackrest does now have block-level incremental backups
(where we define our own block size ...) and there's reasons we decided
against going down the LSN-based approach (not the least of which is
that the LSN isn't always updated...), but long story short, moving to
larger than 1G files should be something that pgbackrest will be able
to handle without as much impact as there would have been previously in
terms of incremental backups.  There is a loss in the ability to use
mtime to scan just the parts of the relation that changed and that's
unfortunate but I wouldn't see it as really a game changer (and yes,
there's certainly an argument for not trusting mtime, though I don't
think we've yet had a report where there was an mtime issue that our
mtime-validity checking didn't catch and force pgbackrest into
checksum-based revalidation automatically which resulted in an invalid
backup... of course, not enough people test their backups...).

> I understand the need to reduce open file handles, despite the
> possibilities enabled by using large numbers of small file sizes.

I'm also generally in favor of reducing the number of open file handles
that we have to deal with.  Addressing the concerns raised nearby about
weird corner-cases of non-1G length ABCDEF.1 files existing while
ABCDEF.2, and more, files exist is certainly another good argument in
favor of getting rid of segments.

> I am curious whether a move like this to create a generational change in
> file file format shouldn't be more ambitious, perhaps altering the block
> format to insert a block format version number, whether that be at every
> block, or every megabyte, or some other interval, and whether we store it
> in-file or in a separate file to accompany the first non-segmented. Having
> such versioning information would allow blocks of different formats to
> co-exist in the same table, which could be critical to future changes such
> as 64 bit XIDs, etc.

To the extent you're interested in this, there are patches posted which
are alrady trying to move us in a direction that would allow for
different page formats that add in space for other features such as
64bit XIDs, better checksums, and TDE tags to be supported.

https://commitfest.postgresql.org/43/3986/

Currently those patches are expecting it to be declared at initdb time,
but the way they're currently written that's more of a soft requirement
as you can tell on a per-page basis what features are enabled for that
page.  Might make sense to support it in that form first anyway though,
before going down the more ambitious route of allowing different pages
to have different sets of features enabled for them concurrently.

When it comes to 'a separate file', we do have forks already and those
serve a very valuable but distinct use-case where you can get
information from the much smaller fork (be it the FSM or the VM or some
future thing) while something like 64bit XIDs or a stronger checksum is
something you'd really need on every page.  I have serious doubts about
a proposal where we'd store information needed on every page read in
some far away block that's still in the same file such as using
something every 1MB as that would turn every block access into two..

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: walsender performance regression due to logical decoding on standby changes
Next
From: David Christensen
Date:
Subject: Re: [PATCHES] Post-special page storage TDE support