Thread: file cloning in pg_upgrade and CREATE DATABASE

file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

21 February 2018, 06:00:04

Here is another attempt at implementing file cloning for pg_upgrade and
CREATE DATABASE.  The idea is to take advantage of file systems that can
make copy-on-write clones, which would make the copy run much faster.
For pg_upgrade, this will give the performance of --link mode without
the associated drawbacks.

There have been patches proposed previously [0][1].  The concerns there
were mainly that they required a Linux-specific ioctl() call and only
worked for Btrfs.

Some new things have happened since then:

- XFS has (optional) reflink support.  This file system is probably more
widely used than Btrfs.

- Linux and glibc have a proper function to do this now.

- APFS on macOS supports file cloning.

So altogether this feature will be more widely usable and less ugly to
implement.  Note, however, that you will currently need literally the
latest glibc release, so it probably won't be accessible right now
unless you are using Fedora 28 for example.  (This is the
copy_file_range() function that had us recently rename the same function
in pg_rewind.)

Some example measurements:

6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
and APFS)

similar for a CREATE DATABASE from a large template

Even if you don't have a file system with cloning support, the special
library calls make copying faster.  For example, on APFS, in this
example, an unpatched CREATE DATABASE takes 30 seconds, with the library
call (but without cloning) it takes 10 seconds.

For amusement/bewilderment, without the recent flush optimization on
APFS, this takes 2 minutes 30 seconds.  I suppose this optimization will
now actually obsolete, since macOS will no longer hit that code.


[0]:
https://www.postgresql.org/message-id/flat/513C0E7C.5080606%40socialserve.com

[1]:
https://www.postgresql.org/message-id/flat/20140213030731.GE4831%40momjian.us
-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABASE.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Robert Haas

Date:

21 February 2018, 19:33:18

On Tue, Feb 20, 2018 at 10:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Some example measurements:
>
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
>
> similar for a CREATE DATABASE from a large template
>
> Even if you don't have a file system with cloning support, the special
> library calls make copying faster.  For example, on APFS, in this
> example, an unpatched CREATE DATABASE takes 30 seconds, with the library
> call (but without cloning) it takes 10 seconds.

Nice results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Tomas Vondra

Date:

22 February 2018, 02:57:19

On 02/21/2018 04:00 AM, Peter Eisentraut wrote:
> ...
> 
> Some example measurements:
> 
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
> 
> similar for a CREATE DATABASE from a large template
> 

Nice improvement, of course. How does that affect performance on the
cloned database? If I understand this correctly, it essentially enables
CoW on the files, so what's the overhead on that? It'd be unfortunate to
speed up CREATE DATABASE only to get degraded performance later.

In any case, I find this interesting mainly for pg_upgrade use case. On
running systems I think the main issue with CREATE DATABASE is that it
forces a checkpoint.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

23 February 2018, 01:49:48

On 2/21/18 18:57, Tomas Vondra wrote:
> Nice improvement, of course. How does that affect performance on the
> cloned database? If I understand this correctly, it essentially enables
> CoW on the files, so what's the overhead on that? It'd be unfortunate to
> speed up CREATE DATABASE only to get degraded performance later.

I ran a little test (on APFS and XFS): Create a large (unlogged) table,
copy the database, then delete everything from the table in the copy.
That should need to CoW all the blocks.  It has about the same
performance with cloning, possibly slightly faster.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

19 March 2018, 10:06:36

On Tue, Feb 20, 2018 at 10:00:04PM -0500, Peter Eisentraut wrote:
> Some new things have happened since then:
>
> - XFS has (optional) reflink support.  This file system is probably more
> widely used than Btrfs.

Btrfs is still in development, there are I think no many people who
would use it in production.

> - Linux and glibc have a proper function to do this now.
>
> - APFS on macOS supports file cloning.

So copyfile() is only part of macos?  I am not able to find references
in FreeBSD, NetBSD or OpenBSD, but I may be missing something.

> So altogether this feature will be more widely usable and less ugly to
> implement.  Note, however, that you will currently need literally the
> latest glibc release, so it probably won't be accessible right now
> unless you are using Fedora 28 for example.  (This is the
> copy_file_range() function that had us recently rename the same function
> in pg_rewind.)

For reference, Debian SID is using glibc 2.27.  ArchLinux is still on
2.26.

> Some example measurements:
>
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)

Interesting.  I'll try to test that on an XFS partition and see if I can
see a difference.  For now I have just read through the patch.

+#ifdef HAVE_COPYFILE
+    if (copyfile(fromfile, tofile, NULL,
+#ifdef COPYFILE_CLONE
+             COPYFILE_CLONE
+#else
+                    COPYFILE_DATA
+#endif
+                       ) < 0)
+        ereport(ERROR,
+            (errcode_for_file_access(),
+             errmsg("could not copy file \"%s\" to \"%s\": %m", fromfile, tofile)));
+#else
        copy_file(fromfile, tofile);
+#endif

Any backend-side callers of copy_file() would not benefit from
copyfile() on OSX.  Shouldn't all that handling be inside copy_file(),
similarly to what your patch actually does for pg_upgrade?  I think that
you should also consider fcopyfile() instead of copyfile() as it works
directly on the file descriptors and share the same error handling as
the others.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

19 March 2018, 10:14:15

On Mon, Mar 19, 2018 at 04:06:36PM +0900, Michael Paquier wrote:
> Any backend-side callers of copy_file() would not benefit from
> copyfile() on OSX.  Shouldn't all that handling be inside copy_file(),
> similarly to what your patch actually does for pg_upgrade?  I think that
> you should also consider fcopyfile() instead of copyfile() as it works
> directly on the file descriptors and share the same error handling as
> the others.

Two other things I have noticed as well:
1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
think when copying data from source to target using the local mode of
pg_rewind.  This could really improve cases where new relations are
added after a promotion.
2) XLogFileCopy() uses a copy logic as well.  For large segments things
could be improved, however we need to be careful about filling in the
end of segments with zeros.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

20 March 2018, 05:58:49

On Mon, Mar 19, 2018 at 04:14:15PM +0900, Michael Paquier wrote:
> Two other things I have noticed as well:
> 1) src/bin/pg_rewind/copy_fetch.c could benefit from similar speed-ups I
> think when copying data from source to target using the local mode of
> pg_rewind.  This could really improve cases where new relations are
> added after a promotion.
> 2) XLogFileCopy() uses a copy logic as well.  For large segments things
> could be improved, however we need to be careful about filling in the
> end of segments with zeros.

I have been thinking about this patch over the night, and here is a list
of bullet points which would be nice to tackle:
- Remove the current diff in copydir.
- Extend copy_file so as it is able to use fcopyfile.
- Move the work done in pg_upgrade into a common API which can as well
be used by pg_rewind as well.  One place would be to have a
frontend-only API in src/common which does the leg work.  I would
recommend working only on file descriptors as well for consistency with
copy_file_range.
- Add proper wait events for the backend calls.  Those are missing for
copy_file_range and copyfile.
- For XLogFileCopy, the problem may be trickier as the tail of a segment
is filled with zeroes, so dropping it from the first version of the
patch sounds wiser.

Patch is switched as waiting on author, I have set myself as a
reviewer.

Thanks,
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

20 March 2018, 17:55:04

On 3/19/18 22:58, Michael Paquier wrote:
> I have been thinking about this patch over the night, and here is a list
> of bullet points which would be nice to tackle:
> - Remove the current diff in copydir.

done

> - Extend copy_file so as it is able to use fcopyfile.

fcopyfile() does not support cloning.  (This is not documented.)

> - Move the work done in pg_upgrade into a common API which can as well
> be used by pg_rewind as well.  One place would be to have a
> frontend-only API in src/common which does the leg work.  I would
> recommend working only on file descriptors as well for consistency with
> copy_file_range.

pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
So I don't think this is going to be a good match.

We could add support for using Linux copy_file_range() in pg_rewind, but
that would work a bit differently.  I also don't have a good sense of
how to test the performance of that.

Another thing to think about is that we go through some trouble to
initialize new WAL files so that the disk space is fully allocated.  If
we used file cloning calls in pg_rewind, that would potentially
invalidate some of that.  At least, we'd have to think through this more
carefully.

> - Add proper wait events for the backend calls.  Those are missing for
> copy_file_range and copyfile.

done

> - For XLogFileCopy, the problem may be trickier as the tail of a segment
> is filled with zeroes, so dropping it from the first version of the
> patch sounds wiser.

Seems like a possible follow-on project.  But see also under pg_rewind
above.

Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
backend does not.  The Git log shows that the backend used to use
CopyFile(), but that was then removed when the generic code was added,
but when pg_upgrade was imported, it came with the CopyFile() call.

I suspect the CopyFile() call can be quite a bit faster, so we should
consider adding it back in.  Or if not, remove it from pg_upgrade.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v2-0001-Use-file-cloning-in-pg_upgrade-and-CREATE-DATABAS.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

22 March 2018, 05:38:31

On Tue, Mar 20, 2018 at 10:55:04AM -0400, Peter Eisentraut wrote:
> On 3/19/18 22:58, Michael Paquier wrote:
>> - Extend copy_file so as it is able to use fcopyfile.
>
> fcopyfile() does not support cloning.  (This is not documented.)

You are right.  I have been reading the documentation here to get an
idea as I don't have a macos system at hand:
https://www.unix.com/man-page/osx/3/fcopyfile/

However I have bumped into that:
http://www.openradar.me/30706426

Future versions will be visibly fixed.

>> - Move the work done in pg_upgrade into a common API which can as well
>> be used by pg_rewind as well.  One place would be to have a
>> frontend-only API in src/common which does the leg work.  I would
>> recommend working only on file descriptors as well for consistency with
>> copy_file_range.
>
> pg_upgrade copies files, whereas pg_rewind needs to copy file ranges.
> So I don't think this is going to be a good match.
>
> We could add support for using Linux copy_file_range() in pg_rewind, but
> that would work a bit differently.  I also don't have a good sense of
> how to test the performance of that.

One simple way to test that would be to limit the time it takes to scan
the WAL segments on the target so as the filemap is computed quickly,
and create many, say gigabyte-size relations on the promoted source
which will need to be copied from the source to the target.

> Another thing to think about is that we go through some trouble to
> initialize new WAL files so that the disk space is fully allocated.  If
> we used file cloning calls in pg_rewind, that would potentially
> invalidate some of that.  At least, we'd have to think through this more
> carefully.

Agreed.  Let's keep in mind such things but come with a sane, first cut
of this patch based on the time remaining in this commit fest.

>> - Add proper wait events for the backend calls.  Those are missing for
>> copy_file_range and copyfile.
>
> done

+         <entry><literal>CopyFileCopy</literal></entry>
+         <entry>Waiting for a file copy operation (if the copying is done by
+         an operating system call rather than as separate read and write
+         operations).</entry>
CopyFileCopy is... Redundant.  Perhaps CopyFileSystem or CopyFileRange?

>> - For XLogFileCopy, the problem may be trickier as the tail of a segment
>> is filled with zeroes, so dropping it from the first version of the
>> patch sounds wiser.
>
> Seems like a possible follow-on project.  But see also under pg_rewind
> above.

No objections to do that in the future for both.

> Another oddity is that pg_upgrade uses CopyFile() on Windows, but the
> backend does not.  The Git log shows that the backend used to use
> CopyFile(), but that was then removed when the generic code was added,
> but when pg_upgrade was imported, it came with the CopyFile() call.

You mean 558730ac, right?

> I suspect the CopyFile() call can be quite a bit faster, so we should
> consider adding it back in.  Or if not, remove it from pg_upgrade.

Hm.  The proposed patch also removes an important property of what
happens now in copy_file: the copied files are periodically synced to
avoid spamming the cache, so for some loads wouldn't this cause a
performance regression?

At least on Linux it is possible to rely on sync_file_range which is
called via pg_flush_data, so it seems to me that we ought to roughly
keep the loop working on FLUSH_DISTANCE, and replace the calls of
read/write by copy_file_range.  copyfile is only able to do a complete
file copy, so we would also lose this property as well on Linux.  Even
for Windows using CopyFile would be a step backwards for the backend.

pg_upgrade is different though as it copies files fully, so using both
copyfile and copy_file_range makes sense.

At the end, it seems to me that using copy_file_range has some values as
you save a set of read/write calls, but copyfile comes with its
limitations, which I think will cause side issues, so I would recommend
dropping it from a first cut of the patch for the backend.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Bruce Momjian

Date:

23 March 2018, 20:16:50

I think this documentation change:

    +   leaving the old cluster untouched.  At present, this is supported on Linux
                                ---------

would be better by changing "untouched" to "unmodified".

Also, it would be nice if users could easily know if pg_upgrade is going
to use COW or not because it might affect whether they choose --link or
not.  Right now it seems unclear how a user would know.  Can we have
pg_upgrade --check perhaps output something.  Can we also have the
pg_upgrade status display indicate that too, e.g. change

    Copying user relation files

to

    Copying (copy-on-write) user relation files

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

26 March 2018, 04:33:38

On 3/21/18 22:38, Michael Paquier wrote:
> At least on Linux it is possible to rely on sync_file_range which is
> called via pg_flush_data, so it seems to me that we ought to roughly
> keep the loop working on FLUSH_DISTANCE, and replace the calls of
> read/write by copy_file_range.  copyfile is only able to do a complete
> file copy, so we would also lose this property as well on Linux.

I have shown earlier in the thread that copy_file_range in one go is
still better than doing it in pieces.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

26 March 2018, 04:34:14

On 3/23/18 13:16, Bruce Momjian wrote:
> Also, it would be nice if users could easily know if pg_upgrade is going
> to use COW or not because it might affect whether they choose --link or
> not.  Right now it seems unclear how a user would know.  Can we have
> pg_upgrade --check perhaps output something.  Can we also have the
> pg_upgrade status display indicate that too, e.g. change
> 
>     Copying user relation files
> 
> to
> 
>     Copying (copy-on-write) user relation files

That would be nice, but we don't have a way to tell that, AFAICT.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

26 March 2018, 09:15:16

On Sun, Mar 25, 2018 at 09:33:38PM -0400, Peter Eisentraut wrote:
> On 3/21/18 22:38, Michael Paquier wrote:
>> At least on Linux it is possible to rely on sync_file_range which is
>> called via pg_flush_data, so it seems to me that we ought to roughly
>> keep the loop working on FLUSH_DISTANCE, and replace the calls of
>> read/write by copy_file_range.  copyfile is only able to do a complete
>> file copy, so we would also lose this property as well on Linux.
>
> I have shown earlier in the thread that copy_file_range in one go is
> still better than doing it in pieces.

f8c183a has introduced the optimization that your patch is removing,
which was discussed on this thread:
https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc
I am not much into the internals of copy_file_range, but isn't there a
risk to have a large range of blocks copied to discard potentially
useful blocks from the OS cache?  That's what this patch makes me worry
about.  Performance is good, but on a system where the OS cache is
heavily used for a set of hot blocks this could cause performance side
effects that I think we canot neglect.

Another thing is that 71d6d07 allowed a couple of database commands to
be more sensitive to interruptions.  With large databases used as a base
template it seems to me that this would cause the interruptions to be
less responsive.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

30 March 2018, 23:15:39

On 3/26/18 02:15, Michael Paquier wrote:
> f8c183a has introduced the optimization that your patch is removing,
> which was discussed on this thread:
> https://www.postgresql.org/message-id/flat/4B78906A.7020309%40mark.mielke.cc

Note that that thread is from 2010 and talks about creation of a
database from the standard template being too slow on spinning rust,
because we fsync too often.  I think we have moved well past that
problem size.

I have run some more tests on both macOS and Linux with ext4, and my
results are that the bigger the flush distance, the better.  Before we
made the adjustments for APFS, we had a flush size of 64kB, now it's 1MB
and 32MB on macOS.  In my tests, I see 256MB as the best across both
platforms, and not flushing early at all is only minimally worse.

You can measure this to death, and this obviously doesn't apply equally
on all systems and configurations, but clearly some of the old
assumptions from 8 years ago are no longer applicable.

> I am not much into the internals of copy_file_range, but isn't there a
> risk to have a large range of blocks copied to discard potentially
> useful blocks from the OS cache?  That's what this patch makes me worry
> about.  Performance is good, but on a system where the OS cache is
> heavily used for a set of hot blocks this could cause performance side
> effects that I think we canot neglect.

How would we go about assessing that?  It's possible, but if
copy_file_range() really blows away all your in-use cache, that would be
surprising.

> Another thing is that 71d6d07 allowed a couple of database commands to
> be more sensitive to interruptions.  With large databases used as a base
> template it seems to me that this would cause the interruptions to be
> less responsive.

The maximum file size that we copy is 1GB and that nowadays takes maybe
10 seconds.  I think that would be an acceptable response time.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

05 April 2018, 23:55:58

I think we have raised a number of interesting issues here which require
more deeper consideration.  So I suggest to set this patch to Returned
with feedback.

Btw., I just learned that copy_file_range() only works on files on the
same device.  So more arrangements will need to be made for that.

> I have run some more tests on both macOS and Linux with ext4, and my> results are that the bigger the flush distance,
thebetter.  Before
 
we> made the adjustments for APFS, we had a flush size of 64kB, now it's
1MB> and 32MB on macOS.  In my tests, I see 256MB as the best across
both> platforms, and not flushing early at all is only minimally worse.
Based on this, I suggest that we set the flush distance to 32MB on all
platforms.  Not only is it faster, it avoids having different settings
on some platforms.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

06 June 2018, 18:58:14

I have made a major revision of this patch.

I have removed all the changes to CREATE DATABASE.  That was too
contentious and we got lost in unrelated details there.  The real
benefit is for pg_upgrade.

Another point was that for pg_upgrade use a user would like to know
beforehand whether reflinking would be used, which was not possible with
the copy_file_range() API.  So here I have switched to using the ioctl()
call directly.

So the new interface is that pg_upgrade has a new option
--reflink={always,auto,never}.  (This option name is adapted from GNU
cp.)  From the documentation:

<para>
 The setting <literal>always</literal> requires the use of relinks.  If
 they are not supported, the <application>pg_upgrade</application> run
 will abort.  Use this in production to limit the upgrade run time.
 The setting <literal>auto</literal> uses reflinks when available,
 otherwise it falls back to a normal copy.  This is the default.  The
 setting <literal>never</literal> prevents use of reflinks and always
 uses a normal copy.  This can be useful to ensure that the upgraded
 cluster has its disk space fully allocated and not shared with the old
 cluster.
</para>

Also, pg_upgrade --check will check whether the selected option would work.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v3-0001-pg_upgrade-Allow-use-of-file-cloning.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Robert Haas

Date:

08 June 2018, 21:06:46

On Wed, Jun 6, 2018 at 11:58 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> --reflink={always,auto,never}.  (This option name is adapted from GNU
...
>  The setting <literal>always</literal> requires the use of relinks.  If

Is it supposed to be relinks or reflinks?  The two lines above don't agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

11 June 2018, 18:31:23

On 6/8/18 14:06, Robert Haas wrote:
> Is it supposed to be relinks or reflinks?  The two lines above don't agree.

It's supposed to be "reflinks".  I'll fix that.

I have also used the more general term "cloning" in the documentation.
We can discuss which term we should use more.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Thomas Munro

Date:

13 July 2018, 08:09:29

On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> - XFS has (optional) reflink support.  This file system is probably more
> widely used than Btrfs.
>
> - Linux and glibc have a proper function to do this now.
>
> - APFS on macOS supports file cloning.

TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
it's not in OpenZFS though I see numerous requests and discussions...
(Of course you can just clone the whole filesystem and then pg_upgrade
the clone in-place).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

13 July 2018, 11:22:21

On 13.07.18 07:09, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> - XFS has (optional) reflink support.  This file system is probably more
>> widely used than Btrfs.
>>
>> - Linux and glibc have a proper function to do this now.
>>
>> - APFS on macOS supports file cloning.
> 
> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
> it's not in OpenZFS though I see numerous requests and discussions...

I look forward to your FreeBSD patch then. ;-)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

13 July 2018, 16:40:42

On Fri, Jul 13, 2018 at 10:22:21AM +0200, Peter Eisentraut wrote:
> On 13.07.18 07:09, Thomas Munro wrote:
>> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
>> it's not in OpenZFS though I see numerous requests and discussions...
>
> I look forward to your FreeBSD patch then. ;-)

+1.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

17 July 2018, 09:58:34

On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote:
> <para>
>  The setting <literal>always</literal> requires the use of relinks.  If
>  they are not supported, the <application>pg_upgrade</application> run
>  will abort.  Use this in production to limit the upgrade run time.
>  The setting <literal>auto</literal> uses reflinks when available,
>  otherwise it falls back to a normal copy.  This is the default.  The
>  setting <literal>never</literal> prevents use of reflinks and always
>  uses a normal copy.  This can be useful to ensure that the upgraded
>  cluster has its disk space fully allocated and not shared with the old
>  cluster.
> </para>

Hm...  I am wondering if we actually want the "auto" mode where we make
the option smarter automatically.  I am afraid of users trying to use it
and being surprised that there is no gain while they expected some.  I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now.  huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

17 July 2018, 13:32:57

On 17.07.18 08:58, Michael Paquier wrote:
> Hm...  I am wondering if we actually want the "auto" mode where we make
> the option smarter automatically.  I am afraid of users trying to use it
> and being surprised that there is no gain while they expected some.  I
> would rather switch that to an on/off switch, which defaults to "off",
> or simply what is available now.  huge_pages=try was a bad idea as the
> result is not deterministic, so I would not have more of that...

Why do you think that was a bad idea?  Doing the best possible job by
default without requiring explicit configuration in each case seems like
an attractive feature.

> Putting CloneFile and check_reflink in a location that other frontend
> binaries could use would be nice, like pg_rewind.

This could be done in subsequent patches, but the previous iteration of
this patch for CREATE DATABASE integration already showed that each of
those cases needs separate consideration.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

01 September 2018, 08:28:37

rebased patch, no functionality changes

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v4-0001-pg_upgrade-Allow-use-of-file-cloning.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

26 September 2018, 09:44:08

Hi Peter,

On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:
> rebased patch, no functionality changes

Could you rebase once again?  I am going through the patch and wanted to
test pg_upgrade on Linux with XFS, but it does not apply anymore.

Thanks,
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

28 September 2018, 00:10:08

On 26/09/2018 08:44, Michael Paquier wrote:
> On Sat, Sep 01, 2018 at 07:28:37AM +0200, Peter Eisentraut wrote:
>> rebased patch, no functionality changes
> 
> Could you rebase once again?  I am going through the patch and wanted to
> test pg_upgrade on Linux with XFS, but it does not apply anymore.

attached

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v5-0001-pg_upgrade-Allow-use-of-file-cloning.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

28 September 2018, 08:19:53

On Thu, Sep 27, 2018 at 11:10:08PM +0200, Peter Eisentraut wrote:
> On 26/09/2018 08:44, Michael Paquier wrote:
>> Could you rebase once again?  I am going through the patch and wanted to
>> test pg_upgrade on Linux with XFS, but it does not apply anymore.
>
> attached

Thanks for the rebase.  At the end I got my hands on only an APFS using
a mac.  I ran a test with an instance holding a database with pgbench at
scaling factor 500, which gives close to 6.5GB.  The results are nice on
my laptop:
- --reflink=never runs in 15s
- --reflink=always runs in 4s
So that's a very nice gain!

+    static bool     cloning_ok = true;
+
+    pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
+           old_file, new_file);
+    if (cloning_ok &&
+        !cloneFile(old_file, new_file, map->nspname, map->relname, true))
+    {
+        pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
+        cloning_ok = false;
+        copyFile(old_file, new_file, map->nspname, map->relname);
+    }
+    else
+        copyFile(old_file, new_file, map->nspname, map->relname);

This part overlaps with the job that check_reflink() already does.
Wouldn't it be more simple to have check_reflink do a one-time check
with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
not pass when REFLINK_AUTO is used?  This would simplify
transfer_relfile().

The --help output of pg_upgrade has not been updated.

I am not a fan of the --reflink interface to be honest, even if this
maps to what cp offers, particularly because there is already the --link
mode, and that the new option interacts with it.  Here is an idea of
interface with an option named, named say, --transfer-method:
- "link", maps to the existing --link, which is just kept as a
deprecated alias.
- "clone" is the new mode you propose.
- "copy" is the default, and copies directly files.  This automatic mode
also makes the implementation around transfer_relfile more difficult to
apprehend in my opinion, and I would think that all the different
transfer modes ought to be maintained within it.  pg_upgrade.h also has
logic for such transfer modes.

After that, the implementation of cloneFile() looks logically correct to
me.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

01 October 2018, 05:23:19

On Fri, Sep 28, 2018 at 02:19:53PM +0900, Michael Paquier wrote:
> Thanks for the rebase.  At the end I got my hands on only an APFS using
> a mac.  I ran a test with an instance holding a database with pgbench at
> scaling factor 500, which gives close to 6.5GB.  The results are nice on
> my laptop:
> - --reflink=never runs in 15s
> - --reflink=always runs in 4s
> So that's a very nice gain!

Please note that I have moved the patch to CF 2018-11, as the last
review is very recent.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

02 October 2018, 15:31:35

On 28/09/2018 07:19, Michael Paquier wrote:
> +    static bool     cloning_ok = true;
> +
> +    pg_log(PG_VERBOSE, "copying \"%s\" to \"%s\"\n",
> +           old_file, new_file);
> +    if (cloning_ok &&
> +        !cloneFile(old_file, new_file, map->nspname, map->relname, true))
> +    {
> +        pg_log(PG_VERBOSE, "cloning not supported, switching to copying\n");
> +        cloning_ok = false;
> +        copyFile(old_file, new_file, map->nspname, map->relname);
> +    }
> +    else
> +        copyFile(old_file, new_file, map->nspname, map->relname);
> 
> This part overlaps with the job that check_reflink() already does.
> Wouldn't it be more simple to have check_reflink do a one-time check
> with the automatic mode, enforcing to REFLINK_NEVER if cloning test did
> not pass when REFLINK_AUTO is used?  This would simplify
> transfer_relfile().

I'll look into that.

> The --help output of pg_upgrade has not been updated.

will fix

> I am not a fan of the --reflink interface to be honest, even if this
> maps to what cp offers, particularly because there is already the --link
> mode, and that the new option interacts with it.  Here is an idea of
> interface with an option named, named say, --transfer-method:
> - "link", maps to the existing --link, which is just kept as a
> deprecated alias.
> - "clone" is the new mode you propose.
> - "copy" is the default, and copies directly files.  This automatic mode
> also makes the implementation around transfer_relfile more difficult to
> apprehend in my opinion, and I would think that all the different
> transfer modes ought to be maintained within it.  pg_upgrade.h also has
> logic for such transfer modes.

I can see the argument for that.  But I don't understand where the
automatic mode fits into this.  I would like to keep all three modes
from my patch: copy, clone-if-possible, clone-or-fail, unless you want
to argue against that.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

03 October 2018, 02:32:48

On Tue, Oct 02, 2018 at 02:31:35PM +0200, Peter Eisentraut wrote:
> I can see the argument for that.  But I don't understand where the
> automatic mode fits into this.  I would like to keep all three modes
> from my patch: copy, clone-if-possible, clone-or-fail, unless you want
> to argue against that.

I'd like to argue against that :)

There could be an argument for having an automatic more within this
scheme, still I am not really a fan of this.  When somebody integrates
pg_upgrade within an upgrade framework, they would likely test if
cloning actually works, bumping immediately on a failure, no?  I'd like
to think that copy should be the default, cloning being available as an
option.  Cloning is not supported on many filesystems anyway..
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Alvaro Herrera

Date:

03 October 2018, 04:35:02

On 2018-Oct-03, Michael Paquier wrote:

> There could be an argument for having an automatic more within this
> scheme, still I am not really a fan of this.  When somebody integrates
> pg_upgrade within an upgrade framework, they would likely test if
> cloning actually works, bumping immediately on a failure, no?  I'd like
> to think that copy should be the default, cloning being available as an
> option.  Cloning is not supported on many filesystems anyway..

I'm not clear what interface are you proposing.  Maybe they would just
use the clone-or-fail mode, and note whether it fails?  If that's not
it, please explain.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

03 October 2018, 04:44:09

On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> I'm not clear what interface are you proposing.  Maybe they would just
> use the clone-or-fail mode, and note whether it fails?  If that's not
> it, please explain.

Okay.  What I am proposing is to not have any kind of automatic mode to
keep the code simple, with a new option called --transfer-mode, able to
do three things:
- "link", which is the equivalent of the existing --link.
- "copy", the default and the current behavior, which copies files.
- "clone", the new mode proposed in this thread.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Alvaro Herrera

Date:

03 October 2018, 05:07:06

On 2018-Oct-03, Michael Paquier wrote:

> On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> > I'm not clear what interface are you proposing.  Maybe they would just
> > use the clone-or-fail mode, and note whether it fails?  If that's not
> > it, please explain.
> 
> Okay.  What I am proposing is to not have any kind of automatic mode to
> keep the code simple, with a new option called --transfer-mode, able to
> do three things:
> - "link", which is the equivalent of the existing --link.
> - "copy", the default and the current behavior, which copies files.
> - "clone", the new mode proposed in this thread.

I see.  Peter is proposing to have a fourth mode, essentially
--transfer-mode=clone-or-copy.

Thinking about a generic tool that wraps pg_upgrade (say, Debian's
wrapper for it) this makes sense: just use the fastest non-destructive
mode available.  Which ones are available depends on what the underlying
filesystem is, so it's not up to the tool's writer to decide which to
use ahead of time.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

03 October 2018, 10:20:03

On Tue, Oct 02, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:
> I see.  Peter is proposing to have a fourth mode, essentially
> --transfer-mode=clone-or-copy.

Yes, a mode which depends on what the file system supports.  Perhaps
"safe" or "fast" could be another name, in the shape of the fastest
method available which does not destroy the existing cluster's data.

> Thinking about a generic tool that wraps pg_upgrade (say, Debian's
> wrapper for it) this makes sense: just use the fastest non-destructive
> mode available.  Which ones are available depends on what the underlying
> filesystem is, so it's not up to the tool's writer to decide which to
> use ahead of time.

This could have merit.  Now, it seems to me that we have two separate
concepts here, which should be addressed separately:
1) cloning file mode, which is the new actual feature.
2) automatic mode, which is a subset of the copy mode and the new clone
mode.  What I am not sure when it comes to that stuff is if we should
consider in some way the link mode as being part of this automatic
selection concept..
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Bruce Momjian

Date:

10 October 2018, 19:50:02

On Tue, Oct  2, 2018 at 11:07:06PM -0300, Alvaro Herrera wrote:
> On 2018-Oct-03, Michael Paquier wrote:
> 
> > On Tue, Oct 02, 2018 at 10:35:02PM -0300, Alvaro Herrera wrote:
> > > I'm not clear what interface are you proposing.  Maybe they would just
> > > use the clone-or-fail mode, and note whether it fails?  If that's not
> > > it, please explain.
> > 
> > Okay.  What I am proposing is to not have any kind of automatic mode to
> > keep the code simple, with a new option called --transfer-mode, able to
> > do three things:
> > - "link", which is the equivalent of the existing --link.
> > - "copy", the default and the current behavior, which copies files.
> > - "clone", the new mode proposed in this thread.
> 
> I see.  Peter is proposing to have a fourth mode, essentially
> --transfer-mode=clone-or-copy.

Uh, if you use --link, and the two data directories are on different
file systems, it fails.  No one has ever asked for link-or-copy, so why
are we considering clone-or-copy?  Are we going to need
link-or-clone-or-copy too?  I do realize that clone and copy have
non-destructive behavior on the old cluster once started, so it does
make some sense to merge them, unlike link.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

11 October 2018, 14:50:50

On 10/10/2018 21:50, Bruce Momjian wrote:
>> I see.  Peter is proposing to have a fourth mode, essentially
>> --transfer-mode=clone-or-copy.
> 
> Uh, if you use --link, and the two data directories are on different
> file systems, it fails.  No one has ever asked for link-or-copy, so why
> are we considering clone-or-copy?  Are we going to need
> link-or-clone-or-copy too?  I do realize that clone and copy have
> non-destructive behavior on the old cluster once started, so it does
> make some sense to merge them, unlike link.

I'm OK to get rid of the clone-or-copy mode.  I can see the confusion.

The original reason for this behavior was that the Linux implementation
used copy_file_range(), which does clone-or-copy without any way to
control it.  But the latest patch doesn't use that anymore, so we don't
really need it, if it's controversial.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

18 October 2018, 21:59:00

On 11/10/2018 16:50, Peter Eisentraut wrote:
> On 10/10/2018 21:50, Bruce Momjian wrote:
>>> I see.  Peter is proposing to have a fourth mode, essentially
>>> --transfer-mode=clone-or-copy.
>>
>> Uh, if you use --link, and the two data directories are on different
>> file systems, it fails.  No one has ever asked for link-or-copy, so why
>> are we considering clone-or-copy?  Are we going to need
>> link-or-clone-or-copy too?  I do realize that clone and copy have
>> non-destructive behavior on the old cluster once started, so it does
>> make some sense to merge them, unlike link.
> 
> I'm OK to get rid of the clone-or-copy mode.  I can see the confusion.

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link.  I think it's much cleaner now.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v6-0001-pg_upgrade-Allow-use-of-file-cloning.patch

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

18 October 2018, 23:50:46

On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
> New patch that removes all the various reflink modes and adds a new
> option --clone that works similar to --link.  I think it's much cleaner now.

Thanks Peter for the new version.

+
+        {"clone", no_argument, NULL, 1},
+
         {NULL, 0, NULL, 0}

Why newlines here?

@@ -293,6 +300,7 @@ usage(void)
     printf(_("  -U, --username=NAME           cluster superuser (default \"%s\")\n"), os_info.user);
     printf(_("  -v, --verbose                 enable verbose internal logging\n"));
     printf(_("  -V, --version                 display version information, then exit\n"));
+    printf(_("  --clone                       clone instead of copying files to new cluster\n"));
     printf(_("  -?, --help                    show this help, then exit\n"));
     printf(_("\n"

An idea for a one-letter option could be -n.  This patch can live
without.

+     pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
+              schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with pg_fatal("error while cloning relation \"%s.%s\":
could not open file \"%s\": %s\n",
+              schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with PG_VERBOSE to mention that cloning checks are
done, and the second one is fatal with the actual errors.

Those are all minor points, the patch looks good to me.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

07 November 2018, 17:42:00

On 19/10/2018 01:50, Michael Paquier wrote:
> On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
>> New patch that removes all the various reflink modes and adds a new
>> option --clone that works similar to --link.  I think it's much cleaner now.
> 
> Thanks Peter for the new version.
> 
> +
> +        {"clone", no_argument, NULL, 1},
> +
>          {NULL, 0, NULL, 0}
> 
> Why newlines here?

fixed

> @@ -293,6 +300,7 @@ usage(void)
>      printf(_("  -U, --username=NAME           cluster superuser (default \"%s\")\n"), os_info.user);
>      printf(_("  -v, --verbose                 enable verbose internal logging\n"));
>      printf(_("  -V, --version                 display version information, then exit\n"));
> +    printf(_("  --clone                       clone instead of copying files to new cluster\n"));
>      printf(_("  -?, --help                    show this help, then exit\n"));
>      printf(_("\n"
> 
> An idea for a one-letter option could be -n.  This patch can live
> without.

-n is often used for something like "dry run", so it didn't go for that
here.  I suspect the cloning will remain a marginal option for some
time, so having only a long option is acceptable.

> +     pg_fatal("error while cloning relation \"%s.%s\": could not open file \"%s\": %s\n",
> +              schemaName, relName, src, strerror(errno));
> 
> The tail of those error messages "could not open file" and "could not
> create file" are already available as translatable error messages.
> Would it be better to split those messages in two for translators?  One
> is generated with pg_fatal("error while cloning relation \"%s.%s\":
> could not open file \"%s\": %s\n",
> +              schemaName, relName, src, strerror(errno));

I think this is too complicated for a few messages.

> Those are all minor points, the patch looks good to me.

Committed, thanks.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Tom Lane

Date:

07 November 2018, 18:03:15

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Committed, thanks.

It seems unfortunate that there is no test coverage in the committed
patch.  I realize that it may be hard to test given the filesystem
dependency, but how will we know if it works at all?

            regards, tom lane

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Michael Paquier

Date:

08 November 2018, 04:56:23

On Wed, Nov 07, 2018 at 06:42:00PM +0100, Peter Eisentraut wrote:
> -n is often used for something like "dry run", so it didn't go for that
> here.  I suspect the cloning will remain a marginal option for some
> time, so having only a long option is acceptable.

Good point.  I didn't consider this, and it is true that --dry-run is
used, pg_rewind being one example.
--
Michael

Attachment

signature.asc

Re: file cloning in pg_upgrade and CREATE DATABASE

From

Peter Eisentraut

Date:

08 November 2018, 07:27:46

On 07/11/2018 19:03, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> Committed, thanks.
> 
> It seems unfortunate that there is no test coverage in the committed
> patch.  I realize that it may be hard to test given the filesystem
> dependency, but how will we know if it works at all?

You can use something like

PG_UPGRADE_OPTS=--clone make -C src/bin/pg_upgrade check

(--link is similarly untested.)

If we do get the TAP tests for pg_upgrade set up, we can create smaller
and faster test cases for this.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services