Large files for relations - Mailing list pgsql-hackers

From Thomas Munro
Subject Large files for relations
Date
Msg-id CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com
Whole thread Raw
Responses Re: Large files for relations
Re: Large files for relations
List pgsql-hackers
Big PostgreSQL databases use and regularly open/close huge numbers of
file descriptors and directory entries for various anachronistic
reasons, one of which is the 1GB RELSEG_SIZE thing.  The segment
management code is trickier that you might think and also still
harbours known bugs.

A nearby analysis of yet another obscure segment life cycle bug
reminded me of this patch set to switch to simple large files and
eventually drop all that.  I originally meant to develop the attached
sketch-quality code further and try proposing it in the 16 cycle,
while I was down the modernisation rabbit hole[1], but then I got side
tracked: at some point I believed that the 56 bit relfilenode thing
might be necessary for correctness, but then I found a set of rules
that seem to hold up without that.  I figured I might as well post
what I have early in the 17 cycle as a "concept" patch to see which
way the flames blow.

There are various boring details due to Windows, and then a load of
fairly obvious changes, and then a whole can of worms about how we'd
handle the transition for the world's fleet of existing databases.
I'll cut straight to that part.  Different choices on aggressiveness
could be made, but here are the straw-man answers I came up with so
far:

1.  All new relations would be in large format only.  No 16384.N
files, just 16384 that can grow to MaxBlockNumber * BLCKSZ.

2.  The existence of a file 16384.1 means that this smgr relation is
in legacy segmented format that came from pg_upgrade (note that we
don't unlink that file once it exists, even when truncating the fork,
until we eventually drop the relation).

3.  Forks that were pg_upgrade'd from earlier releases using hard
links or reflinks would implicitly be in large format if they only had
one segment, and otherwise they could stay in the traditional format
for a grace period of N major releases, after which we'd plan to drop
segment support.  pg_upgrade's [ref]link mode would therefore be the
only way to get a segmented relation, other than a developer-only
trick for testing/debugging.

4.  Every opportunity to convert a multi-segment fork to large format
would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE,
VACUUM FULL, TRUNCATE, etc.  You can see approximately working sketch
versions of all the cases I thought of so far in the attached.

5.  The main places that do file-level copying of relations would use
copy_file_range() to do the splicing, so that on file systems that are
smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and
destination, the operation can be very fast, and other degrees of
optimisation are available to the kernel too even for file systems
without block sharing magic (pushing down block range copies to
hardware/network storage, etc).  The copy_file_range() stuff could
also be proposed independently (I vaguely recall it was discussed a
few times before), it's just that it really comes into its own when
you start splicing files together, as needed here, and it's also been
adopted by FreeBSD with the same interface as Linux and has an
efficient implementation in bleeding edge ZFS there.

Stepping back, the main ideas are: (1) for some users of large
databases, it would be painlessly done at upgrade time without even
really noticing, using modern file system facilities where possible
for speed; (2) for anyone who wants to defer that because of lack of
fast copy_file_range() and a desire to avoid prolonged downtime by
using links or reflinks, concatenation can be put off for the next N
releases, giving a total of 5 + N years of option to defer the work,
and in that case there are also many ways to proactively change to
large format before the time comes with varying degrees of granularity
and disruption.  For example, set up a new replica and fail over, or
VACUUM FULL tables one at a time, etc.

There are plenty of things left to do in this patch set: pg_rewind
doesn't understand optional segmentation yet, there are probably more
things like that, and I expect there are some ssize_t vs pgoff_t
confusions I missed that could bite a 32 bit system.  But you can see
the basics working on a typical system.

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease.  Anyone know of anything to worry about on that
front?  I think the main collateral damage would be weird old external
tools like some weird old version of Windows tar I occasionally see
mentioned, that sort of thing, but that'd just be another case of
"well don't use that then", I guess?  What else might we need to think
about, outside PostgreSQL?

What other problems might occur inside PostgreSQL?  Clearly we'd need
to figure out a decent strategy to automate testing of all of the
relevant transitions.  We could test the splicing code paths with an
optional test suite that you might enable along with a small segment
size (as we're already testing on CI and probably BF after the last
round of segmentation bugs).  To test the messy Windows off_t API
stuff convincingly, we'd need actual > 4GB files, I think?  Maybe
doable cheaply with file system hole punching tricks.

Speaking of file system holes, this patch set doesn't touch buffile.c
That code wants to use segments for two extra purposes: (1) parallel
create index merges workers' output using segmentation tricks as if
there were holes in the file; this could perhaps be replaced with
large files that make use of actual OS-level holes but I didn't feel
like additionally claiming that all computers have spare files --
perhaps another approach is needed anyway; (2) buffile.c deliberately
spreads large buffiles around across multiple temporary tablespaces
using segments supposedly for space management reasons.  So although
it initially looks like a nice safe little place to start using large
files, we'd need an answer to those design choices first.

/me dons flameproof suit and goes back to working on LLVM problems for a while

[1] https://wiki.postgresql.org/wiki/AllComputers
[2] https://en.wikipedia.org/wiki/Comparison_of_file_systems

Attachment

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options
Next
From: Michael Paquier
Date:
Subject: Re: Tab completion for CREATE SCHEMAAUTHORIZATION