Re: Large files for relations - Mailing list pgsql-hackers
From | Stephen Frost |
---|---|
Subject | Re: Large files for relations |
Date | |
Msg-id | ZFrAy0zAeeEz8yRD@tamriel.snowman.net Whole thread Raw |
In response to | Re: Large files for relations (Corey Huinker <corey.huinker@gmail.com>) |
List | pgsql-hackers |
Greetings, * Corey Huinker (corey.huinker@gmail.com) wrote: > On Wed, May 3, 2023 at 1:37 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, May 3, 2023 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> > > wrote: > > > rsync --link-dest ... rsync isn't really a safe tool to use for PG backups by itself unless you're using it with archiving and with start/stop backup and with checksums enabled. > > I wonder if rsync will grow a mode that can use copy_file_range() to > > share blocks with a reference file (= previous backup). Something > > like --copy-range-dest. That'd work for large-file relations > > (assuming a file system that has block sharing, like XFS and ZFS). > > You wouldn't get the "mtime is enough, I don't even need to read the > > bytes" optimisation, which I assume makes all database hackers feel a > > bit queasy anyway, but you'd get the space savings via the usual > > rolling checksum or a cheaper version that only looks for strong > > checksum matches at the same offset, or whatever other tricks rsync > > might have up its sleeve. There's also really good reasons to have multiple full backups and not just a single full backup and then lots and lots of incrementals which basically boils down to "are you really sure that one copy of that one really important file won't every disappear from your backup repository..?" That said, pgbackrest does now have block-level incremental backups (where we define our own block size ...) and there's reasons we decided against going down the LSN-based approach (not the least of which is that the LSN isn't always updated...), but long story short, moving to larger than 1G files should be something that pgbackrest will be able to handle without as much impact as there would have been previously in terms of incremental backups. There is a loss in the ability to use mtime to scan just the parts of the relation that changed and that's unfortunate but I wouldn't see it as really a game changer (and yes, there's certainly an argument for not trusting mtime, though I don't think we've yet had a report where there was an mtime issue that our mtime-validity checking didn't catch and force pgbackrest into checksum-based revalidation automatically which resulted in an invalid backup... of course, not enough people test their backups...). > I understand the need to reduce open file handles, despite the > possibilities enabled by using large numbers of small file sizes. I'm also generally in favor of reducing the number of open file handles that we have to deal with. Addressing the concerns raised nearby about weird corner-cases of non-1G length ABCDEF.1 files existing while ABCDEF.2, and more, files exist is certainly another good argument in favor of getting rid of segments. > I am curious whether a move like this to create a generational change in > file file format shouldn't be more ambitious, perhaps altering the block > format to insert a block format version number, whether that be at every > block, or every megabyte, or some other interval, and whether we store it > in-file or in a separate file to accompany the first non-segmented. Having > such versioning information would allow blocks of different formats to > co-exist in the same table, which could be critical to future changes such > as 64 bit XIDs, etc. To the extent you're interested in this, there are patches posted which are alrady trying to move us in a direction that would allow for different page formats that add in space for other features such as 64bit XIDs, better checksums, and TDE tags to be supported. https://commitfest.postgresql.org/43/3986/ Currently those patches are expecting it to be declared at initdb time, but the way they're currently written that's more of a soft requirement as you can tell on a per-page basis what features are enabled for that page. Might make sense to support it in that form first anyway though, before going down the more ambitious route of allowing different pages to have different sets of features enabled for them concurrently. When it comes to 'a separate file', we do have forks already and those serve a very valuable but distinct use-case where you can get information from the much smaller fork (be it the FSM or the VM or some future thing) while something like 64bit XIDs or a stronger checksum is something you'd really need on every page. I have serious doubts about a proposal where we'd store information needed on every page read in some far away block that's still in the same file such as using something every 1MB as that would turn every block access into two.. Thanks, Stephen
Attachment
pgsql-hackers by date: