Re: should crash recovery ignore checkpoint_flush_after ? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: should crash recovery ignore checkpoint_flush_after ?
Date
Msg-id 20200118233202.ax27prmsvvxqaytx@alap3.anarazel.de
Whole thread Raw
In response to Re: should crash recovery ignore checkpoint_flush_after ?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: should crash recovery ignore checkpoint_flush_after ?
List pgsql-hackers
Hi,

On 2020-01-19 09:52:21 +1300, Thomas Munro wrote:
> On Sun, Jan 19, 2020 at 3:08 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > As I understand, the first thing that happens syncing every file in the data
> > dir, like in initdb --sync.  These instances were both 5+TB on zfs, with
> > compression, so that's slow, but tolerable, and at least understandable, and
> > with visible progress in ps.
> >
> > The 2nd stage replays WAL.  strace show's it's occasionally running
> > sync_file_range, and I think recovery might've been several times faster if
> > we'd just dumped the data at the OS ASAP, fsync once per file.  In fact, I've
> > just kill -9 the recovery process and edited the config to disable this lest it
> > spend all night in recovery.
> 
> Does sync_file_range() even do anything for non-mmap'd files on ZFS?

Good point. Next time it might be worthwhile to use strace -T to see
whether the sync_file_range calls actually take meaningful time.


> Non-mmap'd ZFS data is not in the Linux page cache, and I think
> sync_file_range() works at that level.  At a guess, there'd need to be
> a new VFS file_operation so that ZFS could get a callback to handle
> data in its ARC.

Yea, it requires the pages to be in the pagecache to do anything:

int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
            unsigned int flags)
{
...

    if (flags & SYNC_FILE_RANGE_WRITE) {
        int sync_mode = WB_SYNC_NONE;

        if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
                 SYNC_FILE_RANGE_WRITE_AND_WAIT)
            sync_mode = WB_SYNC_ALL;

        ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
                         sync_mode);
        if (ret < 0)
            goto out;
    }

and then

int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                loff_t end, int sync_mode)
{
    int ret;
    struct writeback_control wbc = {
        .sync_mode = sync_mode,
        .nr_to_write = LONG_MAX,
        .range_start = start,
        .range_end = end,
    };

    if (!mapping_cap_writeback_dirty(mapping) ||
        !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
        return 0;

which means that if there's no pages in the pagecache for the relevant
range, it'll just finish here.  *Iff* there are some, say because
something else mmap()ed a section, it'd potentially call into
address_space->writepages() callback.  So it's possible to emulate
enough state for ZFS or such to still get sync_file_range() call into it
(by setting up a pseudo map tagged as dirty), but it's not really the
normal path.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: should crash recovery ignore checkpoint_flush_after ?
Next
From: David Fetter
Date:
Subject: Re: Use compiler intrinsics for bit ops in hash