Re: Large files for relations - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Large files for relations |
Date | |
Msg-id | CA+hUKGLYqFUzAk1LQfD=gbQnOoU31qAKN+G3HLjnCENZFGBUOg@mail.gmail.com Whole thread Raw |
In response to | Re: Large files for relations (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Large files for relations
Re: Large files for relations Re: Large files for relations |
List | pgsql-hackers |
On Thu, May 25, 2023 at 1:08 PM Stephen Frost <sfrost@snowman.net> wrote: > * Peter Eisentraut (peter.eisentraut@enterprisedb.com) wrote: > > On 24.05.23 02:34, Thomas Munro wrote: > > > * pg_upgrade would convert if source and target don't match > > > > This would be good, but it could also be an optional or later feature. > > Agreed. OK. I do have a patch for that, but I'll put that (+ copy_file_range) aside for now so we can talk about the basic feature. Without that, pg_upgrade just rejects mismatching clusters as it always did, no change required. > > > I would probably also leave out those Windows file API changes, too. > > > --rel-segsize would simply refuse larger sizes until someone does the > > > work on that platform, to keep the initial proposal small. > > > > Those changes from off_t to pgoff_t? Yes, it would be good to do without > > those. Apart of the practical problems that have been brought up, this was > > a major annoyance with the proposed patch set IMO. +1, it was not nice. Alright, since I had some time to kill in an airport, here is a starter patch for initdb --rel-segsize. Some random thoughts: Another potential option name would be --segsize, if we think we're going to use this for temp files too eventually. Maybe it's not so beautiful to have that global variable rel_segment_size (which replaces REL_SEGSIZE everywhere). Another idea would be to make it static in md.c and call smgrsetsegmentsize(), or something like that. That could be a nice place to compute the "shift" value up front, instead of computing it each time in blockno_to_segno(), but that's probably not worth bothering with (?). BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's about the only place where someone could say that this change makes things worse for people not interested in the new feature, so I was careful to get rid of / and % operations with no-longer-constant RHS. I had to promote segment size to int64 (global variable, field in control file), because otherwise it couldn't represent --rel-segsize=32TB (it'd be too big by one). Other ideas would be to store the shift value instead of the size, or store the max block number, eg subtract one, or use InvalidBlockNumber to mean "no limit" (with more branches to test for it). The only problem I ran into with the larger type was that 'SHOW segment_size' now needs a custom show function because we don't have int64 GUCs. A C type confusion problem that I noticed: some code uses BlockNumber and some code uses int for segment numbers. It's not really a reachable problem for practical reasons (you'd need over 2 billion directories and VFDs to reach it), but it's wrong to use int if segment size can be set as low as BLCKSZ (one file per block); you could have more segments than an int can represent. We could go for uint32, BlockNumber or create SegmentNumber (which I think I've proposed before, and lost track of...). We can address that separately (perhaps by finding my old patch...)
Attachment
pgsql-hackers by date: