Re: Large files for relations - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Large files for relations
Date
Msg-id CA+hUKGLYqFUzAk1LQfD=gbQnOoU31qAKN+G3HLjnCENZFGBUOg@mail.gmail.com
Whole thread Raw
In response to Re: Large files for relations  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Large files for relations
Re: Large files for relations
Re: Large files for relations
List pgsql-hackers
On Thu, May 25, 2023 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote:
> > On 24.05.23 02:34, Thomas Munro wrote:
> > > * pg_upgrade would convert if source and target don't match
> >
> > This would be good, but it could also be an optional or later feature.
>
> Agreed.

OK.  I do have a patch for that, but I'll put that (+ copy_file_range)
aside for now so we can talk about the basic feature.  Without that,
pg_upgrade just rejects mismatching clusters as it always did, no
change required.

> > > I would probably also leave out those Windows file API changes, too.
> > > --rel-segsize would simply refuse larger sizes until someone does the
> > > work on that platform, to keep the initial proposal small.
> >
> > Those changes from off_t to pgoff_t?  Yes, it would be good to do without
> > those.  Apart of the practical problems that have been brought up, this was
> > a major annoyance with the proposed patch set IMO.

+1, it was not nice.

Alright, since I had some time to kill in an airport, here is a
starter patch for initdb --rel-segsize.  Some random thoughts:

Another potential option name would be --segsize, if we think we're
going to use this for temp files too eventually.

Maybe it's not so beautiful to have that global variable
rel_segment_size (which replaces REL_SEGSIZE everywhere).  Another
idea would be to make it static in md.c and call smgrsetsegmentsize(),
or something like that.  That could be a nice place to compute the
"shift" value up front, instead of computing it each time in
blockno_to_segno(), but that's probably not worth bothering with (?).
BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
about the only place where someone could say that this change makes
things worse for people not interested in the new feature, so I was
careful to get rid of / and % operations with no-longer-constant RHS.

I had to promote segment size to int64 (global variable, field in
control file), because otherwise it couldn't represent
--rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
store the shift value instead of the size, or store the max block
number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
(with more branches to test for it).  The only problem I ran into with
the larger type was that 'SHOW segment_size' now needs a custom show
function because we don't have int64 GUCs.

A C type confusion problem that I noticed: some code uses BlockNumber
and some code uses int for segment numbers.  It's not really a
reachable problem for practical reasons (you'd need over 2 billion
directories and VFDs to reach it), but it's wrong to use int if
segment size can be set as low as BLCKSZ (one file per block); you
could have more segments than an int can represent.  We could go for
uint32, BlockNumber or create SegmentNumber (which I think I've
proposed before, and lost track of...).  We can address that
separately (perhaps by finding my old patch...)

Attachment

pgsql-hackers by date:

Previous
From: Junwang Zhao
Date:
Subject: Re: session username in default psql prompt?
Next
From: Thomas Munro
Date:
Subject: Re: Large files for relations