Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers

From Andres Freund
Subject Re: refactoring relation extension and BufferAlloc(), faster COPY
Date
Msg-id 20230329041355.5k6yywi7uer5v5sa@awork3.anarazel.de
Whole thread Raw
In response to Re: refactoring relation extension and BufferAlloc(), faster COPY  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
Hi,

On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in
> > Hi,
> >
> > Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
> > relation extension related code in hio.c back into its own function.
> >
> > While reviewing the hio.c code, I did realize that too much stuff is done
> > while holding the buffer lock. See also the pre-existing issue
> > https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de
>
> 0001, 0002 looks fine to me.
>
> 0003 adds the new function FileFallocte, but we already have
> AllocateFile. Although fd.c contains functions with varying word
> orders, it could be confusing that closely named functions have
> different naming conventions.

The syscall is named fallocate, I don't think we'd gain anything by inventing
a different name for it? Given that there's a number of File$syscall
operations, I think it's clear enough that it just fits into that. Unless you
have a better proposal?


> +    /*
> +     * Return in cases of a "real" failure, if fallocate is not supported,
> +     * fall through to the FileZero() backed implementation.
> +     */
> +    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
> +        return returnCode;
>
> I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
> be returned. Some googling indicate that ENOSYS might need the same
> amendment to EOPNOTSUPP. However, I'm not clear on why man
> posix_fallocate donsn't mention the former.

posix_fallocate() and its errors are specified by posix, I guess. I think
glibc etc will map ENOSYS to EOPNOTSUPP.

I really dislike this bit from the posix_fallocate manpage:

EINVAL offset was less than 0, or len was less than or equal to 0, or the underlying filesystem does not support the
operation.

Why oh why would you add the "or .." portion into EINVAL, when there's also
EOPNOTSUPP?

>
> +        (returnCode != EINVAL && returnCode != EINVAL))
> :)
>
>
> FileGetRawDesc(File file)
>  {
>      Assert(FileIsValid(file));
> +
> +    if (FileAccess(file) < 0)
> +        return -1;
> +
>
> The function's comment is provided below.
>
> > * The returned file descriptor will be valid until the file is closed, but
> > * there are a lot of things that can make that happen.  So the caller should
> > * be careful not to do much of anything else before it finishes using the
> > * returned file descriptor.
>
> So, the responsibility to make sure the file is valid seems to lie
> with the callers, although I'm not sure since there aren't any
> function users in the tree.

Except, as I think you realized as well, external callers *can't* call
FileAccess(), it's static.


> I'm unclear as to why FileSize omits the case lruLessRecently != file.

Not quite following - why would FileSize() deal with lruLessRecently itself?
Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying
on FileAccess() doing that internally?


> When examining similar functions, such as FileGetRawFlags and
> FileGetRawMode, I'm puzzled to find that FileAccess() nor
> BasicOpenFilePermthe don't set the struct members referred to by the
> functions.

Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm()
returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As
BasicOpenFilePerm() says:
 * This is exported for use by places that really want a plain kernel FD,
 * but need to be proof against running out of FDs. ...


I don't think FileAccess() needs to set those struct members, that's already
been done in PathNameOpenFilePerm().


> This makes my question the usefulness of these functions including
> FileGetRawDesc().

It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a
safe way...


> Regardless, since the
> patchset doesn't use FileGetRawDesc(), I don't believe the fix is
> necessary in this patch set.

Yea. It was used in an earlier version, but not anymore.



>
> +    if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
>
> I'm not sure it is appropriate to assume InvalidBlockNumber equals
> MaxBlockNumber + 1 in this context.

Hm. This is just the multi-block equivalent of what mdextend() already
does. It's not pretty, indeed. I'm not sure there's really a better thing to
do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as
mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to
extend past the size limit.


>
> +         * However, we don't use FileAllocate() for small extensions, as it
> +         * defeats delayed allocation on some filesystems. Not clear where
> +         * that decision should be made though? For now just use a cutoff of
> +         * 8, anything between 4 and 8 worked OK in some local testing.
>
> The chose is quite similar to what FileFallocate() makes. However, I'm
> not sure FileFallocate() itself should be doing this.

I'm not following - there's no such choice in FileFallocate()? Do you mean
that FileFallocate() also falls back to FileZero()? I don't think those are
comparable.

We don't want the code outside of fd.c to have to implement a fallback for
platforms that don't have fallocate (or something similar), that's why
FileFallocate() falls back to FileZero().

Here we care about not calling FileFallocate() in too small increments, when
the relation might extend further. If we somehow knew in mdzeroextend() that
the file won't be extended further, it'd be a good idea to call
FileFallocate(), even if just for a single block - it'd prevent the kernel
from wasting memory for delayed allocation.  But unfortunately we don't know
if it's the final size, hence the heuristic.

Does that make sense?


Thanks!

Andres



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Next
From: Amit Kapila
Date:
Subject: Re: Data is copied twice when specifying both child and parent table in publication