Hi,
On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync. These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL. strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file. In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
>
> Does sync_file_range() even do anything for non-mmap'd files on ZFS?
Good point. Next time it might be worthwhile to use strace -T to see
whether the sync_file_range calls actually take meaningful time.
> Non-mmap'd ZFS data is not in the Linux page cache, and I think
> sync_file_range() works at that level. At a guess, there'd need to be
> a new VFS file_operation so that ZFS could get a callback to handle
> data in its ARC.
Yea, it requires the pages to be in the pagecache to do anything:
int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
unsigned int flags)
{
...
if (flags & SYNC_FILE_RANGE_WRITE) {
int sync_mode = WB_SYNC_NONE;
if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
SYNC_FILE_RANGE_WRITE_AND_WAIT)
sync_mode = WB_SYNC_ALL;
ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
sync_mode);
if (ret < 0)
goto out;
}
and then
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
.sync_mode = sync_mode,
.nr_to_write = LONG_MAX,
.range_start = start,
.range_end = end,
};
if (!mapping_cap_writeback_dirty(mapping) ||
!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
which means that if there's no pages in the pagecache for the relevant
range, it'll just finish here. *Iff* there are some, say because
something else mmap()ed a section, it'd potentially call into
address_space->writepages() callback. So it's possible to emulate
enough state for ZFS or such to still get sync_file_range() call into it
(by setting up a pseudo map tagged as dirty), but it's not really the
normal path.
Greetings,
Andres Freund