Thread: file cloning in pg_upgrade and CREATE DATABASE
Here is another attempt at implementing file cloning for pg_upgrade and CREATE DATABASE. The idea is to take advantage of file systems that can make copy-on-write clones, which would make the copy run much faster. For pg_upgrade, this will give the performance of --link mode without the associated drawbacks. There have been patches proposed previously [0][1]. The concerns there were mainly that they required a Linux-specific ioctl() call and only worked for Btrfs. Some new things have happened since then: - XFS has (optional) reflink support. This file system is probably more widely used than Btrfs. - Linux and glibc have a proper function to do this now. - APFS on macOS supports file cloning. So altogether this feature will be more widely usable and less ugly to implement. Note, however, that you will currently need literally the latest glibc release, so it probably won't be accessible right now unless you are using Fedora 28 for example. (This is the copy_file_range() function that had us recently rename the same function in pg_rewind.) Some example measurements: 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS and APFS) similar for a CREATE DATABASE from a large template Even if you don't have a file system with cloning support, the special library calls make copying faster. For example, on APFS, in this example, an unpatched CREATE DATABASE takes 30 seconds, with the library call (but without cloning) it takes 10 seconds. For amusement/bewilderment, without the recent flush optimization on APFS, this takes 2 minutes 30 seconds. I suppose this optimization will now actually obsolete, since macOS will no longer hit that code. [0]: https://www.postgresql.org/message-id/flat/513C0E7C.5080606%40socialserve.com [1]: https://www.postgresql.org/message-id/flat/20140213030731.GE4831%40momjian.us -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Feb 20, 2018 at 10:00 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Some example measurements: > > 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS > and APFS) > > similar for a CREATE DATABASE from a large template > > Even if you don't have a file system with cloning support, the special > library calls make copying faster. For example, on APFS, in this > example, an unpatched CREATE DATABASE takes 30 seconds, with the library > call (but without cloning) it takes 10 seconds. Nice results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/21/2018 04:00 AM, Peter Eisentraut wrote: > ... > > Some example measurements: > > 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS > and APFS) > > similar for a CREATE DATABASE from a large template > Nice improvement, of course. How does that affect performance on the cloned database? If I understand this correctly, it essentially enables CoW on the files, so what's the overhead on that? It'd be unfortunate to speed up CREATE DATABASE only to get degraded performance later. In any case, I find this interesting mainly for pg_upgrade use case. On running systems I think the main issue with CREATE DATABASE is that it forces a checkpoint. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/21/18 18:57, Tomas Vondra wrote: > Nice improvement, of course. How does that affect performance on the > cloned database? If I understand this correctly, it essentially enables > CoW on the files, so what's the overhead on that? It'd be unfortunate to > speed up CREATE DATABASE only to get degraded performance later. I ran a little test (on APFS and XFS): Create a large (unlogged) table, copy the database, then delete everything from the table in the copy. That should need to CoW all the blocks. It has about the same performance with cloning, possibly slightly faster. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 20, 2018 at 10:00:04PM -0500, Peter Eisentraut wrote: > Some new things have happened since then: > > - XFS has (optional) reflink support. This file system is probably more > widely used than Btrfs. Btrfs is still in development, there are I think no many people who would use it in production. > - Linux and glibc have a proper function to do this now. > > - APFS on macOS supports file cloning. So copyfile() is only part of macos? I am not able to find references in FreeBSD, NetBSD or OpenBSD, but I may be missing something. > So altogether this feature will be more widely usable and less ugly to > implement. Note, however, that you will currently need literally the > latest glibc release, so it probably won't be accessible right now > unless you are using Fedora 28 for example. (This is the > copy_file_range() function that had us recently rename the same function > in pg_rewind.) For reference, Debian SID is using glibc 2.27. ArchLinux is still on 2.26. > Some example measurements: > > 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS > and APFS) Interesting. I'll try to test that on an XFS partition and see if I can see a difference. For now I have just read through the patch. +#ifdef HAVE_COPYFILE + if (copyfile(fromfile, tofile, NULL, +#ifdef COPYFILE_CLONE + COPYFILE_CLONE +#else + COPYFILE_DATA +#endif + ) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not copy file \"%s\" to \"%s\": %m", fromfile, tofile))); +#else copy_file(fromfile, tofile); +#endif Any backend-side callers of copy_file() would not benefit from copyfile() on OSX. Shouldn't all that handling be inside copy_file(), similarly to what your patch actually does for pg_upgrade? I think that you should also consider fcopyfile() instead of copyfile() as it works directly on the file descriptors and share the same error handling as the others. -- Michael
Attachment
On Mon, Mar 19, 2018 at 04:06:36PM +0900, Michael Paquier wrote: > Any backend-side callers of copy_file() would not benefit from > copyfile() on OSX. Shouldn't all that handling be inside copy_file(), > similarly to what your patch actually does for pg_upgrade? I think that > you should also consider fcopyfile() instead of copyfile() as it works > directly on the file descriptors and share the same error handling as > the others. Two other things I have noticed as well: 1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I think when copying data from source to target using the local mode of pg_rewind. This could really improve cases where new relations are added after a promotion. 2) XLogFileCopy() uses a copy logic as well. For large segments things could be improved, however we need to be careful about filling in the end of segments with zeros. -- Michael
Attachment
On Mon, Mar 19, 2018 at 04:14:15PM +0900, Michael Paquier wrote: > Two other things I have noticed as well: > 1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I > think when copying data from source to target using the local mode of > pg_rewind. This could really improve cases where new relations are > added after a promotion. > 2) XLogFileCopy() uses a copy logic as well. For large segments things > could be improved, however we need to be careful about filling in the > end of segments with zeros. I have been thinking about this patch over the night, and here is a list of bullet points which would be nice to tackle: - Remove the current diff in copydir. - Extend copy_file so as it is able to use fcopyfile. - Move the work done in pg_upgrade into a common API which can as well be used by pg_rewind as well. One place would be to have a frontend-only API in src/common which does the leg work. I would recommend working only on file descriptors as well for consistency with copy_file_range. - Add proper wait events for the backend calls. Those are missing for copy_file_range and copyfile. - For XLogFileCopy, the problem may be trickier as the tail of a segment is filled with zeroes, so dropping it from the first version of the patch sounds wiser. Patch is switched as waiting on author, I have set myself as a reviewer. Thanks, -- Michael
Attachment
On 3/19/18 22:58, Michael Paquier wrote: > I have been thinking about this patch over the night, and here is a list > of bullet points which would be nice to tackle: > - Remove the current diff in copydir. done > - Extend copy_file so as it is able to use fcopyfile. fcopyfile() does not support cloning. (This is not documented.) > - Move the work done in pg_upgrade into a common API which can as well > be used by pg_rewind as well. One place would be to have a > frontend-only API in src/common which does the leg work. I would > recommend working only on file descriptors as well for consistency with > copy_file_range. pg_upgrade copies files, whereas pg_rewind needs to copy file ranges. So I don't think this is going to be a good match. We could add support for using Linux copy_file_range() in pg_rewind, but that would work a bit differently. I also don't have a good sense of how to test the performance of that. Another thing to think about is that we go through some trouble to initialize new WAL files so that the disk space is fully allocated. If we used file cloning calls in pg_rewind, that would potentially invalidate some of that. At least, we'd have to think through this more carefully. > - Add proper wait events for the backend calls. Those are missing for > copy_file_range and copyfile. done > - For XLogFileCopy, the problem may be trickier as the tail of a segment > is filled with zeroes, so dropping it from the first version of the > patch sounds wiser. Seems like a possible follow-on project. But see also under pg_rewind above. Another oddity is that pg_upgrade uses CopyFile() on Windows, but the backend does not. The Git log shows that the backend used to use CopyFile(), but that was then removed when the generic code was added, but when pg_upgrade was imported, it came with the CopyFile() call. I suspect the CopyFile() call can be quite a bit faster, so we should consider adding it back in. Or if not, remove it from pg_upgrade. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Mar 20, 2018 at 10:55:04AM -0400, Peter Eisentraut wrote: > On 3/19/18 22:58, Michael Paquier wrote: >> - Extend copy_file so as it is able to use fcopyfile. > > fcopyfile() does not support cloning. (This is not documented.) You are right. I have been reading the documentation here to get an idea as I don't have a macos system at hand: https://www.unix.com/man-page/osx/3/fcopyfile/ However I have bumped into that: http://www.openradar.me/30706426 Future versions will be visibly fixed. >> - Move the work done in pg_upgrade into a common API which can as well >> be used by pg_rewind as well. One place would be to have a >> frontend-only API in src/common which does the leg work. I would >> recommend working only on file descriptors as well for consistency with >> copy_file_range. > > pg_upgrade copies files, whereas pg_rewind needs to copy file ranges. > So I don't think this is going to be a good match. > > We could add support for using Linux copy_file_range() in pg_rewind, but > that would work a bit differently. I also don't have a good sense of > how to test the performance of that. One simple way to test that would be to limit the time it takes to scan the WAL segments on the target so as the filemap is computed quickly, and create many, say gigabyte-size relations on the promoted source which will need to be copied from the source to the target. > Another thing to think about is that we go through some trouble to > initialize new WAL files so that the disk space is fully allocated. If > we used file cloning calls in pg_rewind, that would potentially > invalidate some of that. At least, we'd have to think through this more > carefully. Agreed. Let's keep in mind such things but come with a sane, first cut of this patch based on the time remaining in this commit fest. >> - Add proper wait events for the backend calls. Those are missing for >> copy_file_range and copyfile. > > done + <entry><literal>CopyFileCopy</literal></entry> + <entry>Waiting for a file copy operation (if the copying is done by + an operating system call rather than as separate read and write + operations).</entry> CopyFileCopy is... Redundant. Perhaps CopyFileSystem or CopyFileRange? >> - For XLogFileCopy, the problem may be trickier as the tail of a segment >> is filled with zeroes, so dropping it from the first version of the >> patch sounds wiser. > > Seems like a possible follow-on project. But see also under pg_rewind > above. No objections to do that in the future for both. > Another oddity is that pg_upgrade uses CopyFile() on Windows, but the > backend does not. The Git log shows that the backend used to use > CopyFile(), but that was then removed when the generic code was added, > but when pg_upgrade was imported, it came with the CopyFile() call. You mean 558730ac, right? > I suspect the CopyFile() call can be quite a bit faster, so we should > consider adding it back in. Or if not, remove it from pg_upgrade. Hm. The proposed patch also removes an important property of what happens now in copy_file: the copied files are periodically synced to avoid spamming the cache, so for some loads wouldn't this cause a performance regression? At least on Linux it is possible to rely on sync_file_range which is called via pg_flush_data, so it seems to me that we ought to roughly keep the loop working on FLUSH_DISTANCE, and replace the calls of read/write by copy_file_range. copyfile is only able to do a complete file copy, so we would also lose this property as well on Linux. Even for Windows using CopyFile would be a step backwards for the backend. pg_upgrade is different though as it copies files fully, so using both copyfile and copy_file_range makes sense. At the end, it seems to me that using copy_file_range has some values as you save a set of read/write calls, but copyfile comes with its limitations, which I think will cause side issues, so I would recommend dropping it from a first cut of the patch for the backend. -- Michael
Attachment
I think this documentation change: + leaving the old cluster untouched. At present, this is supported on Linux --------- would be better by changing "untouched" to "unmodified". Also, it would be nice if users could easily know if pg_upgrade is going to use COW or not because it might affect whether they choose --link or not. Right now it seems unclear how a user would know. Can we have pg_upgrade --check perhaps output something. Can we also have the pg_upgrade status display indicate that too, e.g. change Copying user relation files to Copying (copy-on-write) user relation files -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 3/21/18 22:38, Michael Paquier wrote: > At least on Linux it is possible to rely on sync_file_range which is > called via pg_flush_data, so it seems to me that we ought to roughly > keep the loop working on FLUSH_DISTANCE, and replace the calls of > read/write by copy_file_range. copyfile is only able to do a complete > file copy, so we would also lose this property as well on Linux. I have shown earlier in the thread that copy_file_range in one go is still better than doing it in pieces. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3/23/18 13:16, Bruce Momjian wrote: > Also, it would be nice if users could easily know if pg_upgrade is going > to use COW or not because it might affect whether they choose --link or > not. Right now it seems unclear how a user would know. Can we have > pg_upgrade --check perhaps output something. Can we also have the > pg_upgrade status display indicate that too, e.g. change > > Copying user relation files > > to > > Copying (copy-on-write) user relation files That would be nice, but we don't have a way to tell that, AFAICT. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Mar 25, 2018 at 09:33:38PM -0400, Peter Eisentraut wrote: > On 3/21/18 22:38, Michael Paquier wrote: >> At least on Linux it is possible to rely on sync_file_range which is >> called via pg_flush_data, so it seems to me that we ought to roughly >> keep the loop working on FLUSH_DISTANCE, and replace the calls of >> read/write by copy_file_range. copyfile is only able to do a complete >> file copy, so we would also lose this property as well on Linux. > > I have shown earlier in the thread that copy_file_range in one go is > still better than doing it in pieces. f8c183a has introduced the optimization that your patch is removing, which was discussed on this thread: https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc I am not much into the internals of copy_file_range, but isn't there a risk to have a large range of blocks copied to discard potentially useful blocks from the OS cache? That's what this patch makes me worry about. Performance is good, but on a system where the OS cache is heavily used for a set of hot blocks this could cause performance side effects that I think we canot neglect. Another thing is that 71d6d07 allowed a couple of database commands to be more sensitive to interruptions. With large databases used as a base template it seems to me that this would cause the interruptions to be less responsive. -- Michael
Attachment
On 3/26/18 02:15, Michael Paquier wrote: > f8c183a has introduced the optimization that your patch is removing, > which was discussed on this thread: > https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc Note that that thread is from 2010 and talks about creation of a database from the standard template being too slow on spinning rust, because we fsync too often. I think we have moved well past that problem size. I have run some more tests on both macOS and Linux with ext4, and my results are that the bigger the flush distance, the better. Before we made the adjustments for APFS, we had a flush size of 64kB, now it's 1MB and 32MB on macOS. In my tests, I see 256MB as the best across both platforms, and not flushing early at all is only minimally worse. You can measure this to death, and this obviously doesn't apply equally on all systems and configurations, but clearly some of the old assumptions from 8 years ago are no longer applicable. > I am not much into the internals of copy_file_range, but isn't there a > risk to have a large range of blocks copied to discard potentially > useful blocks from the OS cache? That's what this patch makes me worry > about. Performance is good, but on a system where the OS cache is > heavily used for a set of hot blocks this could cause performance side > effects that I think we canot neglect. How would we go about assessing that? It's possible, but if copy_file_range() really blows away all your in-use cache, that would be surprising. > Another thing is that 71d6d07 allowed a couple of database commands to > be more sensitive to interruptions. With large databases used as a base > template it seems to me that this would cause the interruptions to be > less responsive. The maximum file size that we copy is 1GB and that nowadays takes maybe 10 seconds. I think that would be an acceptable response time. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I think we have raised a number of interesting issues here which require more deeper consideration. So I suggest to set this patch to Returned with feedback. Btw., I just learned that copy_file_range() only works on files on the same device. So more arrangements will need to be made for that. > I have run some more tests on both macOS and Linux with ext4, and my> results are that the bigger the flush distance, thebetter. Before we> made the adjustments for APFS, we had a flush size of 64kB, now it's 1MB> and 32MB on macOS. In my tests, I see 256MB as the best across both> platforms, and not flushing early at all is only minimally worse. Based on this, I suggest that we set the flush distance to 32MB on all platforms. Not only is it faster, it avoids having different settings on some platforms. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I have made a major revision of this patch. I have removed all the changes to CREATE DATABASE. That was too contentious and we got lost in unrelated details there. The real benefit is for pg_upgrade. Another point was that for pg_upgrade use a user would like to know beforehand whether reflinking would be used, which was not possible with the copy_file_range() API. So here I have switched to using the ioctl() call directly. So the new interface is that pg_upgrade has a new option --reflink={always,auto,never}. (This option name is adapted from GNU cp.) From the documentation: <para> The setting <literal>always</literal> requires the use of relinks. If they are not supported, the <application>pg_upgrade</application> run will abort. Use this in production to limit the upgrade run time. The setting <literal>auto</literal> uses reflinks when available, otherwise it falls back to a normal copy. This is the default. The setting <literal>never</literal> prevents use of reflinks and always uses a normal copy. This can be useful to ensure that the upgraded cluster has its disk space fully allocated and not shared with the old cluster. </para> Also, pg_upgrade --check will check whether the selected option would work. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Jun 6, 2018 at 11:58 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > --reflink={always,auto,never}. (This option name is adapted from GNU ... > The setting <literal>always</literal> requires the use of relinks. If Is it supposed to be relinks or reflinks? The two lines above don't agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/8/18 14:06, Robert Haas wrote: > Is it supposed to be relinks or reflinks? The two lines above don't agree. It's supposed to be "reflinks". I'll fix that. I have also used the more general term "cloning" in the documentation. We can discuss which term we should use more. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > - XFS has (optional) reflink support. This file system is probably more > widely used than Btrfs. > > - Linux and glibc have a proper function to do this now. > > - APFS on macOS supports file cloning. TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly, it's not in OpenZFS though I see numerous requests and discussions... (Of course you can just clone the whole filesystem and then pg_upgrade the clone in-place). -- Thomas Munro http://www.enterprisedb.com
On 13.07.18 07:09, Thomas Munro wrote: > On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> - XFS has (optional) reflink support. This file system is probably more >> widely used than Btrfs. >> >> - Linux and glibc have a proper function to do this now. >> >> - APFS on macOS supports file cloning. > > TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly, > it's not in OpenZFS though I see numerous requests and discussions... I look forward to your FreeBSD patch then. ;-) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jul 13, 2018 at 10:22:21AM +0200, Peter Eisentraut wrote: > On 13.07.18 07:09, Thomas Munro wrote: >> TIL that Solaris 11.4 (closed) ZFS supports reflink() too. Sadly, >> it's not in OpenZFS though I see numerous requests and discussions... > > I look forward to your FreeBSD patch then. ;-) +1. -- Michael
Attachment
On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote: > <para> > The setting <literal>always</literal> requires the use of relinks. If > they are not supported, the <application>pg_upgrade</application> run > will abort. Use this in production to limit the upgrade run time. > The setting <literal>auto</literal> uses reflinks when available, > otherwise it falls back to a normal copy. This is the default. The > setting <literal>never</literal> prevents use of reflinks and always > uses a normal copy. This can be useful to ensure that the upgraded > cluster has its disk space fully allocated and not shared with the old > cluster. > </para> Hm... I am wondering if we actually want the "auto" mode where we make the option smarter automatically. I am afraid of users trying to use it and being surprised that there is no gain while they expected some. I would rather switch that to an on/off switch, which defaults to "off", or simply what is available now. huge_pages=try was a bad idea as the result is not deterministic, so I would not have more of that... Putting CloneFile and check_reflink in a location that other frontend binaries could use would be nice, like pg_rewind. -- Michael
Attachment
On 17.07.18 08:58, Michael Paquier wrote: > Hm... I am wondering if we actually want the "auto" mode where we make > the option smarter automatically. I am afraid of users trying to use it > and being surprised that there is no gain while they expected some. I > would rather switch that to an on/off switch, which defaults to "off", > or simply what is available now. huge_pages=try was a bad idea as the > result is not deterministic, so I would not have more of that... Why do you think that was a bad idea? Doing the best possible job by default without requiring explicit configuration in each case seems like an attractive feature. > Putting CloneFile and check_reflink in a location that other frontend > binaries could use would be nice, like pg_rewind. This could be done in subsequent patches, but the previous iteration of this patch for CREATE DATABASE integration already showed that each of those cases needs separate consideration. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
rebased patch, no functionality changes -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi Peter, On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote: > rebased patch, no functionality changes Could you rebase once again? I am going through the patch and wanted to test pg_upgrade on Linux with XFS, but it does not apply anymore. Thanks, -- Michael
Attachment
On 26/09/2018 08:44, Michael Paquier wrote: > On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote: >> rebased patch, no functionality changes > > Could you rebase once again? I am going through the patch and wanted to > test pg_upgrade on Linux with XFS, but it does not apply anymore. attached -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Sep 27, 2018 at 11:10:08PM +0200, Peter Eisentraut wrote: > On 26/09/2018 08:44, Michael Paquier wrote: >> Could you rebase once again? I am going through the patch and wanted to >> test pg_upgrade on Linux with XFS, but it does not apply anymore. > > attached Thanks for the rebase. At the end I got my hands on only an APFS using a mac. I ran a test with an instance holding a database with pgbench at scaling factor 500, which gives close to 6.5GB. The results are nice on my laptop: - --reflink=never runs in 15s - --reflink=always runs in 4s So that's a very nice gain! + static bool cloning_ok = true; + + pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", + old_file, new_file); + if (cloning_ok && + !cloneFile(old_file, new_file, map->nspname, map->relname, true)) + { + pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n"); + cloning_ok = false; + copyFile(old_file, new_file, map->nspname, map->relname); + } + else + copyFile(old_file, new_file, map->nspname, map->relname); This part overlaps with the job that check_reflink() already does. Wouldn't it be more simple to have check_reflink do a one-time check with the automatic mode, enforcing to REFLINK_NEVER if cloning test did not pass when REFLINK_AUTO is used? This would simplify transfer_relfile(). The --help output of pg_upgrade has not been updated. I am not a fan of the --reflink interface to be honest, even if this maps to what cp offers, particularly because there is already the --link mode, and that the new option interacts with it. Here is an idea of interface with an option named, named say, --transfer-method: - "link", maps to the existing --link, which is just kept as a deprecated alias. - "clone" is the new mode you propose. - "copy" is the default, and copies directly files. This automatic mode also makes the implementation around transfer_relfile more difficult to apprehend in my opinion, and I would think that all the different transfer modes ought to be maintained within it. pg_upgrade.h also has logic for such transfer modes. After that, the implementation of cloneFile() looks logically correct to me. -- Michael
Attachment
On Fri, Sep 28, 2018 at 02:19:53PM +0900, Michael Paquier wrote: > Thanks for the rebase. At the end I got my hands on only an APFS using > a mac. I ran a test with an instance holding a database with pgbench at > scaling factor 500, which gives close to 6.5GB. The results are nice on > my laptop: > - --reflink=never runs in 15s > - --reflink=always runs in 4s > So that's a very nice gain! Please note that I have moved the patch to CF 2018-11, as the last review is very recent. -- Michael
Attachment
On 28/09/2018 07:19, Michael Paquier wrote: > + static bool cloning_ok = true; > + > + pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n", > + old_file, new_file); > + if (cloning_ok && > + !cloneFile(old_file, new_file, map->nspname, map->relname, true)) > + { > + pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n"); > + cloning_ok = false; > + copyFile(old_file, new_file, map->nspname, map->relname); > + } > + else > + copyFile(old_file, new_file, map->nspname, map->relname); > > This part overlaps with the job that check_reflink() already does. > Wouldn't it be more simple to have check_reflink do a one-time check > with the automatic mode, enforcing to REFLINK_NEVER if cloning test did > not pass when REFLINK_AUTO is used? This would simplify > transfer_relfile(). I'll look into that. > The --help output of pg_upgrade has not been updated. will fix > I am not a fan of the --reflink interface to be honest, even if this > maps to what cp offers, particularly because there is already the --link > mode, and that the new option interacts with it. Here is an idea of > interface with an option named, named say, --transfer-method: > - "link", maps to the existing --link, which is just kept as a > deprecated alias. > - "clone" is the new mode you propose. > - "copy" is the default, and copies directly files. This automatic mode > also makes the implementation around transfer_relfile more difficult to > apprehend in my opinion, and I would think that all the different > transfer modes ought to be maintained within it. pg_upgrade.h also has > logic for such transfer modes. I can see the argument for that. But I don't understand where the automatic mode fits into this. I would like to keep all three modes from my patch: copy, clone-if-possible, clone-or-fail, unless you want to argue against that. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 02, 2018 at 02:31:35PM +0200, Peter Eisentraut wrote: > I can see the argument for that. But I don't understand where the > automatic mode fits into this. I would like to keep all three modes > from my patch: copy, clone-if-possible, clone-or-fail, unless you want > to argue against that. I'd like to argue against that :) There could be an argument for having an automatic more within this scheme, still I am not really a fan of this. When somebody integrates pg_upgrade within an upgrade framework, they would likely test if cloning actually works, bumping immediately on a failure, no? I'd like to think that copy should be the default, cloning being available as an option. Cloning is not supported on many filesystems anyway.. -- Michael
Attachment
On 2018-Oct-03, Michael Paquier wrote: > There could be an argument for having an automatic more within this > scheme, still I am not really a fan of this. When somebody integrates > pg_upgrade within an upgrade framework, they would likely test if > cloning actually works, bumping immediately on a failure, no? I'd like > to think that copy should be the default, cloning being available as an > option. Cloning is not supported on many filesystems anyway.. I'm not clear what interface are you proposing. Maybe they would just use the clone-or-fail mode, and note whether it fails? If that's not it, please explain. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote: > I'm not clear what interface are you proposing. Maybe they would just > use the clone-or-fail mode, and note whether it fails? If that's not > it, please explain. Okay. What I am proposing is to not have any kind of automatic mode to keep the code simple, with a new option called --transfer-mode, able to do three things: - "link", which is the equivalent of the existing --link. - "copy", the default and the current behavior, which copies files. - "clone", the new mode proposed in this thread. -- Michael
Attachment
On 2018-Oct-03, Michael Paquier wrote: > On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote: > > I'm not clear what interface are you proposing. Maybe they would just > > use the clone-or-fail mode, and note whether it fails? If that's not > > it, please explain. > > Okay. What I am proposing is to not have any kind of automatic mode to > keep the code simple, with a new option called --transfer-mode, able to > do three things: > - "link", which is the equivalent of the existing --link. > - "copy", the default and the current behavior, which copies files. > - "clone", the new mode proposed in this thread. I see. Peter is proposing to have a fourth mode, essentially --transfer-mode=clone-or-copy. Thinking about a generic tool that wraps pg_upgrade (say, Debian's wrapper for it) this makes sense: just use the fastest non-destructive mode available. Which ones are available depends on what the underlying filesystem is, so it's not up to the tool's writer to decide which to use ahead of time. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote: > I see. Peter is proposing to have a fourth mode, essentially > --transfer-mode=clone-or-copy. Yes, a mode which depends on what the file system supports. Perhaps "safe" or "fast" could be another name, in the shape of the fastest method available which does not destroy the existing cluster's data. > Thinking about a generic tool that wraps pg_upgrade (say, Debian's > wrapper for it) this makes sense: just use the fastest non-destructive > mode available. Which ones are available depends on what the underlying > filesystem is, so it's not up to the tool's writer to decide which to > use ahead of time. This could have merit. Now, it seems to me that we have two separate concepts here, which should be addressed separately: 1) cloning file mode, which is the new actual feature. 2) automatic mode, which is a subset of the copy mode and the new clone mode. What I am not sure when it comes to that stuff is if we should consider in some way the link mode as being part of this automatic selection concept.. -- Michael
Attachment
On Tue, Oct 2, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote: > On 2018-Oct-03, Michael Paquier wrote: > > > On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote: > > > I'm not clear what interface are you proposing. Maybe they would just > > > use the clone-or-fail mode, and note whether it fails? If that's not > > > it, please explain. > > > > Okay. What I am proposing is to not have any kind of automatic mode to > > keep the code simple, with a new option called --transfer-mode, able to > > do three things: > > - "link", which is the equivalent of the existing --link. > > - "copy", the default and the current behavior, which copies files. > > - "clone", the new mode proposed in this thread. > > I see. Peter is proposing to have a fourth mode, essentially > --transfer-mode=clone-or-copy. Uh, if you use --link, and the two data directories are on different file systems, it fails. No one has ever asked for link-or-copy, so why are we considering clone-or-copy? Are we going to need link-or-clone-or-copy too? I do realize that clone and copy have non-destructive behavior on the old cluster once started, so it does make some sense to merge them, unlike link. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 10/10/2018 21:50, Bruce Momjian wrote: >> I see. Peter is proposing to have a fourth mode, essentially >> --transfer-mode=clone-or-copy. > > Uh, if you use --link, and the two data directories are on different > file systems, it fails. No one has ever asked for link-or-copy, so why > are we considering clone-or-copy? Are we going to need > link-or-clone-or-copy too? I do realize that clone and copy have > non-destructive behavior on the old cluster once started, so it does > make some sense to merge them, unlike link. I'm OK to get rid of the clone-or-copy mode. I can see the confusion. The original reason for this behavior was that the Linux implementation used copy_file_range(), which does clone-or-copy without any way to control it. But the latest patch doesn't use that anymore, so we don't really need it, if it's controversial. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/10/2018 16:50, Peter Eisentraut wrote: > On 10/10/2018 21:50, Bruce Momjian wrote: >>> I see. Peter is proposing to have a fourth mode, essentially >>> --transfer-mode=clone-or-copy. >> >> Uh, if you use --link, and the two data directories are on different >> file systems, it fails. No one has ever asked for link-or-copy, so why >> are we considering clone-or-copy? Are we going to need >> link-or-clone-or-copy too? I do realize that clone and copy have >> non-destructive behavior on the old cluster once started, so it does >> make some sense to merge them, unlike link. > > I'm OK to get rid of the clone-or-copy mode. I can see the confusion. New patch that removes all the various reflink modes and adds a new option --clone that works similar to --link. I think it's much cleaner now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote: > New patch that removes all the various reflink modes and adds a new > option --clone that works similar to --link. I think it's much cleaner now. Thanks Peter for the new version. + + {"clone", no_argument, NULL, 1}, + {NULL, 0, NULL, 0} Why newlines here? @@ -293,6 +300,7 @@ usage(void) printf(_(" -U, --username=NAME cluster superuser (default \"%s\")\n"), os_info.user); printf(_(" -v, --verbose enable verbose internal logging\n")); printf(_(" -V, --version display version information, then exit\n")); + printf(_(" --clone clone instead of copying files to new cluster\n")); printf(_(" -?, --help show this help, then exit\n")); printf(_("\n" An idea for a one-letter option could be -n. This patch can live without. + pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n", + schemaName, relName, src, strerror(errno)); The tail of those error messages "could not open file" and "could not create file" are already available as translatable error messages. Would it be better to split those messages in two for translators? One is generated with pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n", + schemaName, relName, src, strerror(errno)); The tail of those error messages "could not open file" and "could not create file" are already available as translatable error messages. Would it be better to split those messages in two for translators? One is generated with PG_VERBOSE to mention that cloning checks are done, and the second one is fatal with the actual errors. Those are all minor points, the patch looks good to me. -- Michael
Attachment
On 19/10/2018 01:50, Michael Paquier wrote: > On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote: >> New patch that removes all the various reflink modes and adds a new >> option --clone that works similar to --link. I think it's much cleaner now. > > Thanks Peter for the new version. > > + > + {"clone", no_argument, NULL, 1}, > + > {NULL, 0, NULL, 0} > > Why newlines here? fixed > @@ -293,6 +300,7 @@ usage(void) > printf(_(" -U, --username=NAME cluster superuser (default \"%s\")\n"), os_info.user); > printf(_(" -v, --verbose enable verbose internal logging\n")); > printf(_(" -V, --version display version information, then exit\n")); > + printf(_(" --clone clone instead of copying files to new cluster\n")); > printf(_(" -?, --help show this help, then exit\n")); > printf(_("\n" > > An idea for a one-letter option could be -n. This patch can live > without. -n is often used for something like "dry run", so it didn't go for that here. I suspect the cloning will remain a marginal option for some time, so having only a long option is acceptable. > + pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n", > + schemaName, relName, src, strerror(errno)); > > The tail of those error messages "could not open file" and "could not > create file" are already available as translatable error messages. > Would it be better to split those messages in two for translators? One > is generated with pg_fatal("error while cloning relation \"%s.%s\": > could not open file \"%s\": %s\n", > + schemaName, relName, src, strerror(errno)); I think this is too complicated for a few messages. > Those are all minor points, the patch looks good to me. Committed, thanks. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > Committed, thanks. It seems unfortunate that there is no test coverage in the committed patch. I realize that it may be hard to test given the filesystem dependency, but how will we know if it works at all? regards, tom lane
On Wed, Nov 07, 2018 at 06:42:00PM +0100, Peter Eisentraut wrote: > -n is often used for something like "dry run", so it didn't go for that > here. I suspect the cloning will remain a marginal option for some > time, so having only a long option is acceptable. Good point. I didn't consider this, and it is true that --dry-run is used, pg_rewind being one example. -- Michael
Attachment
On 07/11/2018 19:03, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> Committed, thanks. > > It seems unfortunate that there is no test coverage in the committed > patch. I realize that it may be hard to test given the filesystem > dependency, but how will we know if it works at all? You can use something like PG_UPGRADE_OPTS=--clone make -C src/bin/pg_upgrade check (--link is similarly untested.) If we do get the TAP tests for pg_upgrade set up, we can create smaller and faster test cases for this. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services