Re: Btrfs clone WIP patch - Mailing list pgsql-hackers

From Jonathan Rogers
Subject Re: Btrfs clone WIP patch
Date
Msg-id 513C0E7C.5080606@socialserve.com
Whole thread Raw
In response to Re: Btrfs clone WIP patch  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Btrfs clone WIP patch  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
Greg Smith wrote:
> I think I can see how to construct such an example for the btrfs
> version, but having you show that explicitly (preferably with a whole
> sample session executing it) will also help reviewers.  Remember:  if
> you want to get your submission off to a good start, the reviewer should
> be able to run your sample test, see the code work, and do something fun
> within a few seconds of compiling it.  Make that easy for them, and your
> reviewer will start with a good impression of you and a positive outlook
> for the change.

Yes, an example is very simple with Btrfs, since it only requires one
GUC variable and that the cluster be created on a Btrfs file system.
Constructing a ZFS is decidedly non-trivial and I'm starting to question
if it's worth it.

> 
> Now onto the code nitpicking!
> 
> = Extension vs. GUC =
> 
> In addition to not polluting the postgresql.conf.sample, there's another
> reason this might make for better extension material eventually.  Adding
> new execv calls that are building strings like this is just generally a
> risky thing.  It would be nice from a security perspective if that
> entire mechanism wasn't even introduced into the server at all unless
> someone loaded the extension.
> 
> An extension implementation will end up being more code, both to add a
> useful hook for replace these calls and for the extension packaging
> itself.  Having a bit more code in contrib/ but less in src &
> postgresql.conf is probably a net positive trade though.

I will look into how to write an extension.

> 
> = Diff reduction / code refactoring =
> 
> Looks like you added a "File Operation Options" entry into guc.c but
> then not use it here?  I would just keep this in an existing category
> for now, try to reduce the diff length of the proof of concept version
> as much as possible in the beginning.

Yes, that is a good idea.

> 
> On the topic of smaller diffs, the similar cut and paste sections of the
> two entries that both do fork/exec/waitpid should really be refactored
> into one function.  The archiver does something similar for running
> archive_command, there may be some reuse or refactoring of its code to
> present this interface.

I'll look at archive_command to see what might be in common.

> 
> Again, this sort of refactoring is not necessary as a POC patch.  But it
> will probably come up if this moves toward commit candidate.
> 
> = Recursion and directory navigation =
> 
>> In either case, the directories are copied recursively while the
>> Postgres internal copydir function does not recurse. I don't think that
>> should be a problem since there shouldn't be nested directories in the
>> first place.
> 
> copydir takes an option for whether it should recurse or not.
> 
> The rm side of makes me twitch for a number of reasons.  First off,
> there's just the general scariness of the concept of shelling out to run
> rm recursively with some text string you build.  The worst time I saw a
> bug in that sort of code destroyed a terabyte, and the last time I saw
> such a bug was only a week ago.  Validation you're doing the right thing
> is always job #1 in removing files.

I needed to call an external command to remove a directory only when
experimenting on ZFS since the regular implementation works fine on
Btrfs. Unlike Btrfs, ZFS does not have any capability to clone
individual files. Therefore, a directory under base/ has to start out as
a ZFS snapshot, which can be cloned with the "zfs clone" command to
become a snapshot directory with exactly the same contents. To remove
the clone, the "zfs destroy" command has to be called on the clone.
AFAICT, clone and remove operations on ZFS always operate on a whole
directory at a time.

> 
> The same sort of issue is in your external_copydir.  Iterating into
> subdirectories when it doesn't happen now just isn't safe, even though
> the one expected case you're hooking won't be any different.  You really
> can't just do that.  Would this work instead, and is there any concern
> about files that start with a "."?
> 
> "cp * --reflink=auto"
> 
> Regardless, you need to keep most of the structure to copydir anyway.
> Error handling, handling cancellation, and fsync calls are all vital
> things.  You probably have to make the forked command copy a single file
> at a time to get the same interrupt handling behavior.
> 

In an earlier implementation, I did call "cp --reflink=auto" once per
regular file, preserving the behavior of copydir. On Btrfs, this works
well, though slightly slower due to extra processes. AFAIK, there's no
way to do something equivalent on ZFS without coming up with a much more
complicated scheme involving both links and clones.

I don't think it will be possible to implement a scheme that works on
ZFS and addresses your concerns about file and directory handling that
is not many times more complex than what I have so far. OTOH, I think
the approach I have already implemented which calls an external command
for each regular file to copy might be acceptable. Since I don't
personally have much interest in using ZFS, I think I'm going to just
focus on Btrfs for now.

-- 
Jonathan Ross Rogers



pgsql-hackers by date:

Previous
From: Dann Corbit
Date:
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
Next
From: Michael Paquier
Date:
Subject: Re: Support for REINDEX CONCURRENTLY