Thread: mdclose() does not cope w/ FileClose() failure

mdclose() does not cope w/ FileClose() failure

From
Noah Misch
Date:
Forking thread "WAL logging problem in 9.4.3?" for this tangent:

On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> I don't understand why mdclose checks for (v->mdfd_vfd >= 0) of open
> segment but anyway mdimmedsync is believing that that won't happen and
> I follow the assumption.  (I suspect that the if condition in mdclose
> should be an assertion..)

That check helps when data_sync_retry=on and FileClose() raised an error in a
previous mdclose() invocation.  However, the check is not sufficient to make
that case work; the attached test case (not for commit) gets an assertion
failure or SIGSEGV.

I am inclined to fix this by decrementing md_num_open_segs before modifying
md_seg_fds (second attachment).  An alternative would be to call
_fdvec_resize() after every FileClose(), like mdtruncate() does; however, the
repalloc() overhead could be noticeable.  (mdclose() is called much more
frequently than mdtruncate().)


Incidentally, _mdfd_openseg() has this:

    if (segno <= reln->md_num_open_segs[forknum])
        _fdvec_resize(reln, forknum, segno + 1);

That should be >=, not <=.  If the less-than case happened, this would delete
the record of a vfd for a higher-numbered segno.  There's no live bug, because
only segno == reln->md_num_open_segs[forknum] actually happens.  I am inclined
to make an assertion of that and remove the condition:

    Assert(segno == reln->md_num_open_segs[forknum]);
    _fdvec_resize(reln, forknum, segno + 1);

Attachment

Re: mdclose() does not cope w/ FileClose() failure

From
Noah Misch
Date:
On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote:
> I am inclined to fix this by decrementing md_num_open_segs before modifying
> md_seg_fds (second attachment).

That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the
allocated array length.  The alternative is looking better:

> An alternative would be to call
> _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the
> repalloc() overhead could be noticeable.  (mdclose() is called much more
> frequently than mdtruncate().)

I can skip repalloc() when the array length decreases, to assuage mdclose()'s
worry.  In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will
still pfree() the array.  Array elements that mdtruncate() frees today will
instead persist to end of transaction.  That is okay, since mdtruncate()
crossing more than one segment boundary is fairly infrequent.  For it to
happen, you must either create a >2G relation and then TRUNCATE it in the same
transaction, or VACUUM must find >1-2G of unused space at the end of the
relation.  I'm now inclined to do it that way, attached.

Attachment

Re: mdclose() does not cope w/ FileClose() failure

From
Thomas Munro
Date:
On Sun, Dec 22, 2019 at 10:19 PM Noah Misch <noah@leadboat.com> wrote:
>         Assert(segno == reln->md_num_open_segs[forknum]);
>         _fdvec_resize(reln, forknum, segno + 1);

Oh yeah, I spotted that part too but didn't follow up.


https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNBw%2BuSzxF1os-SO6gUuw%3DcqO5DAybk6KnHKzgGvxhxA%40mail.gmail.com



Re: mdclose() does not cope w/ FileClose() failure

From
Noah Misch
Date:
On Mon, Dec 23, 2019 at 09:33:29AM +1300, Thomas Munro wrote:
> On Sun, Dec 22, 2019 at 10:19 PM Noah Misch <noah@leadboat.com> wrote:
> >         Assert(segno == reln->md_num_open_segs[forknum]);
> >         _fdvec_resize(reln, forknum, segno + 1);
> 
> Oh yeah, I spotted that part too but didn't follow up.
> 
>
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNBw%2BuSzxF1os-SO6gUuw%3DcqO5DAybk6KnHKzgGvxhxA%40mail.gmail.com

That patch of yours looks good.



Re: mdclose() does not cope w/ FileClose() failure

From
Kyotaro Horiguchi
Date:
Hello.

At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote:
> > I am inclined to fix this by decrementing md_num_open_segs before modifying
> > md_seg_fds (second attachment).
> 
> That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the
> allocated array length.  The alternative is looking better:

I agree that v2 is cleaner in the light of modularity and fixes the
memory leak happens at re-open.

> > An alternative would be to call
> > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the
> > repalloc() overhead could be noticeable.  (mdclose() is called much more
> > frequently than mdtruncate().)
> 
> I can skip repalloc() when the array length decreases, to assuage mdclose()'s
> worry.  In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will
> still pfree() the array.  Array elements that mdtruncate() frees today will
> instead persist to end of transaction.  That is okay, since mdtruncate()
> crossing more than one segment boundary is fairly infrequent.  For it to
> happen, you must either create a >2G relation and then TRUNCATE it in the same
> transaction, or VACUUM must find >1-2G of unused space at the end of the
> relation.  I'm now inclined to do it that way, attached.

         * It doesn't seem worthwhile complicating the code by having a more
         * aggressive growth strategy here; the number of segments doesn't
         * grow that fast, and the memory context internally will sometimes
-         * avoid doing an actual reallocation.
+         * avoid doing an actual reallocation.  Likewise, since the number of
+         * segments doesn't shrink that fast, don't shrink at all.  During
+         * mdclose(), we'll pfree the array at nseg==0.

If I understand it correctly, it is mentioning the number of the all
segment files in a fork, not the length of md_seg_fds arrays at a
certain moment. But actually _fdvec_resize is called for every segment
opening during mdnblocks (just-after mdopen), and every segment
closing during mdclose and mdtruncate as mentioned here. We are going
to omit pallocs only in the decreasing case.

If we regard repalloc as far faster than FileOpen/FileClose or we care
about only increase of segment number of mdopen'ed files and don't
care the frequent resize that happens during the functions above, then
the comment is right and we may resize the array in the
segment-by-segment manner.

But if they are comparable each other, or we don't want the array gets
resized frequently, we might need to prevent repalloc from happening
on every segment increase, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: mdclose() does not cope w/ FileClose() failure

From
Noah Misch
Date:
On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote:
> > > An alternative would be to call
> > > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the
> > > repalloc() overhead could be noticeable.  (mdclose() is called much more
> > > frequently than mdtruncate().)
> > 
> > I can skip repalloc() when the array length decreases, to assuage mdclose()'s
> > worry.  In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will
> > still pfree() the array.  Array elements that mdtruncate() frees today will
> > instead persist to end of transaction.  That is okay, since mdtruncate()
> > crossing more than one segment boundary is fairly infrequent.  For it to
> > happen, you must either create a >2G relation and then TRUNCATE it in the same
> > transaction, or VACUUM must find >1-2G of unused space at the end of the
> > relation.  I'm now inclined to do it that way, attached.
> 
>          * It doesn't seem worthwhile complicating the code by having a more
>          * aggressive growth strategy here; the number of segments doesn't
>          * grow that fast, and the memory context internally will sometimes
> -         * avoid doing an actual reallocation.
> +         * avoid doing an actual reallocation.  Likewise, since the number of
> +         * segments doesn't shrink that fast, don't shrink at all.  During
> +         * mdclose(), we'll pfree the array at nseg==0.
> 
> If I understand it correctly, it is mentioning the number of the all
> segment files in a fork, not the length of md_seg_fds arrays at a
> certain moment. But actually _fdvec_resize is called for every segment
> opening during mdnblocks (just-after mdopen), and every segment
> closing during mdclose and mdtruncate as mentioned here. We are going
> to omit pallocs only in the decreasing case.

That is a good point.  How frequently one adds 1 GiB of data is not the main
issue.  mdclose() and subsequent re-opening of all segments will be more
relevant to overall performance.

> If we regard repalloc as far faster than FileOpen/FileClose or we care
> about only increase of segment number of mdopen'ed files and don't
> care the frequent resize that happens during the functions above, then
> the comment is right and we may resize the array in the
> segment-by-segment manner.

In most cases, the array will fit into a power-of-two chunk, so repalloc()
already does the right thing.  Once the table has more than ~1000 segments (~1
TiB table size), the allocation will get a single-chunk block, and every
subsequent repalloc() will call realloc().  Even then, repalloc() probably is
far faster than File operations.  Likely, I should just accept the extra
repalloc() calls and drop the "else if" change in _fdvec_resize().



Re: mdclose() does not cope w/ FileClose() failure

From
Kyotaro Horiguchi
Date:
At Tue, 24 Dec 2019 11:57:39 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote:
> > If I understand it correctly, it is mentioning the number of the all
> > segment files in a fork, not the length of md_seg_fds arrays at a
> > certain moment. But actually _fdvec_resize is called for every segment
> > opening during mdnblocks (just-after mdopen), and every segment
> > closing during mdclose and mdtruncate as mentioned here. We are going
> > to omit pallocs only in the decreasing case.
> 
> That is a good point.  How frequently one adds 1 GiB of data is not the main
> issue.  mdclose() and subsequent re-opening of all segments will be more
> relevant to overall performance.

Yes, that's exactly what I meant.

> > If we regard repalloc as far faster than FileOpen/FileClose or we care
> > about only increase of segment number of mdopen'ed files and don't
> > care the frequent resize that happens during the functions above, then
> > the comment is right and we may resize the array in the
> > segment-by-segment manner.
> 
> In most cases, the array will fit into a power-of-two chunk, so repalloc()
> already does the right thing.  Once the table has more than ~1000 segments (~1
> TiB table size), the allocation will get a single-chunk block, and every
> subsequent repalloc() will call realloc().  Even then, repalloc() probably is
> far faster than File operations.  Likely, I should just accept the extra
> repalloc() calls and drop the "else if" change in _fdvec_resize().

I'm not sure which is better. If we say we know that
repalloc(AllocSetRealloc) doesn't free memory at all, there's no point
in calling repalloc for shrinking and we could omit that under the
name of optimization.  If we say we want to free memory as much as
possible, we should call repalloc pretending to believe that that
happens.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: mdclose() does not cope w/ FileClose() failure

From
Noah Misch
Date:
On Wed, Dec 25, 2019 at 10:39:32AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 24 Dec 2019 11:57:39 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote:
> > > If we regard repalloc as far faster than FileOpen/FileClose or we care
> > > about only increase of segment number of mdopen'ed files and don't
> > > care the frequent resize that happens during the functions above, then
> > > the comment is right and we may resize the array in the
> > > segment-by-segment manner.
> > 
> > In most cases, the array will fit into a power-of-two chunk, so repalloc()
> > already does the right thing.  Once the table has more than ~1000 segments (~1
> > TiB table size), the allocation will get a single-chunk block, and every
> > subsequent repalloc() will call realloc().  Even then, repalloc() probably is
> > far faster than File operations.  Likely, I should just accept the extra
> > repalloc() calls and drop the "else if" change in _fdvec_resize().
> 
> I'm not sure which is better. If we say we know that
> repalloc(AllocSetRealloc) doesn't free memory at all, there's no point
> in calling repalloc for shrinking and we could omit that under the
> name of optimization.  If we say we want to free memory as much as
> possible, we should call repalloc pretending to believe that that
> happens.

As long as we free the memory by the end of mdclose(), I think it doesn't
matter whether we freed memory in the middle of mdclose().

I ran a crude benchmark that found PathNameOpenFile()+FileClose() costing at
least two hundred times as much as the repalloc() pair.  Hence, I now plan not
to avoid repalloc(), as attached.  Crude benchmark code:

    #define NSEG 9000
    for (i = 0; i < count1; i++)
    {
        int j;

        for (j = 0; j < NSEG; ++j)
        {
            File f = PathNameOpenFile("/etc/services", O_RDONLY);
            if (f < 0)
                elog(ERROR, "fail open: %m");
            FileClose(f);
        }
    }

    for (i = 0; i < count2; i++)
    {
        int j;
        void *buf = palloc(1);

        for (j = 2; j < NSEG; ++j)
            buf = repalloc(buf, j * 8);
        while (--j > 0)
            buf = repalloc(buf, j * 8);
    }

Attachment

Re: mdclose() does not cope w/ FileClose() failure

From
Kyotaro Horiguchi
Date:
At Wed, 1 Jan 2020 23:46:02 -0800, Noah Misch <noah@leadboat.com> wrote in 
> On Wed, Dec 25, 2019 at 10:39:32AM +0900, Kyotaro Horiguchi wrote:
> > I'm not sure which is better. If we say we know that
> > repalloc(AllocSetRealloc) doesn't free memory at all, there's no point
> > in calling repalloc for shrinking and we could omit that under the
> > name of optimization.  If we say we want to free memory as much as
> > possible, we should call repalloc pretending to believe that that
> > happens.
> 
> As long as we free the memory by the end of mdclose(), I think it doesn't
> matter whether we freed memory in the middle of mdclose().

Agreed.

> I ran a crude benchmark that found PathNameOpenFile()+FileClose() costing at
> least two hundred times as much as the repalloc() pair.  Hence, I now plan not
> to avoid repalloc(), as attached.  Crude benchmark code:

I got about 25 times difference with -O0 and about 50 times with -O2.
(xfs / CentOS8) It's smaller than I intuitively expected but perhaps
50 times difference is large enough.

The patch looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center