On Tue, Nov 23, 2021 at 06:54:03PM +0000, Jacob Champion wrote:
> On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote:
> > On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> > > 
> > > - why we read()/write() at all? is not a faster way of copying the file?
> > >   i'm asking that because i don't actually know.
> > 
> > No portable way.  Linux has this:
> >
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.html&data=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3D&reserved=0
> > 
> > But I just read:
> > 
> > >       First support for cross-filesystem copies was introduced in Linux
> > >       5.3.  Older kernels will return -EXDEV when cross-filesystem
> > >       copies are attempted.
> > 
> > To me that sounds like it may not be worth it, at least not quite yet.
> > But it would be good to test.
I realized that pg_upgrade doesn't copy between filesystems - it copies from
$tablespace/PG13/NNN to $tblespace/PG14/NNN.  So that's no issue.
And I did a bit of testing with this last weekend, and saw no performance
benefit from a larger buffersize, nor from copy_file_range, nor from libc stdio
(fopen/fread/fwrite/fclose).
> I think a downside of copy_file_range() is that filesystems might
> perform a reflink under us, and to me that seems like something that
> needs to be opted into via clone mode.
You're referring to this:
|       copy_file_range()  gives  filesystems an opportunity to implement "copy
|    acceleration" techniques, such as the use of reflinks (i.e., two or more
|    i-nodes that share pointers to the same copy-on-write disk blocks) or
|    server-side-copy (in the case of NFS).
I don't see why that's an issue though ?  It's COW, not hardlink.  It'd be the
same as if the filesystem implemented deduplication, right?  postgres shouldn't
notice nor care.
I guess you're concerned for someone who wants to be able to run pg_upgrade and
preserve the ability to start the old cluster in addition to the new.  But
that'd work fine on a COW filesystem, right ?
> (https://lwn.net/Articles/846403/ is also good reading on some sharp
> edges, though I doubt many of them apply to our use case.)
Yea, it doesn't seem the issues are relevant, other than to indicate that the
syscall is still evolving, which supports my initial conclusion.
-- 
Justin