Thread: mdclose() does not cope w/ FileClose() failure
Forking thread "WAL logging problem in 9.4.3?" for this tangent: On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote: > I don't understand why mdclose checks for (v->mdfd_vfd >= 0) of open > segment but anyway mdimmedsync is believing that that won't happen and > I follow the assumption. (I suspect that the if condition in mdclose > should be an assertion..) That check helps when data_sync_retry=on and FileClose() raised an error in a previous mdclose() invocation. However, the check is not sufficient to make that case work; the attached test case (not for commit) gets an assertion failure or SIGSEGV. I am inclined to fix this by decrementing md_num_open_segs before modifying md_seg_fds (second attachment). An alternative would be to call _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the repalloc() overhead could be noticeable. (mdclose() is called much more frequently than mdtruncate().) Incidentally, _mdfd_openseg() has this: if (segno <= reln->md_num_open_segs[forknum]) _fdvec_resize(reln, forknum, segno + 1); That should be >=, not <=. If the less-than case happened, this would delete the record of a vfd for a higher-numbered segno. There's no live bug, because only segno == reln->md_num_open_segs[forknum] actually happens. I am inclined to make an assertion of that and remove the condition: Assert(segno == reln->md_num_open_segs[forknum]); _fdvec_resize(reln, forknum, segno + 1);
Attachment
On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote: > I am inclined to fix this by decrementing md_num_open_segs before modifying > md_seg_fds (second attachment). That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the allocated array length. The alternative is looking better: > An alternative would be to call > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the > repalloc() overhead could be noticeable. (mdclose() is called much more > frequently than mdtruncate().) I can skip repalloc() when the array length decreases, to assuage mdclose()'s worry. In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will still pfree() the array. Array elements that mdtruncate() frees today will instead persist to end of transaction. That is okay, since mdtruncate() crossing more than one segment boundary is fairly infrequent. For it to happen, you must either create a >2G relation and then TRUNCATE it in the same transaction, or VACUUM must find >1-2G of unused space at the end of the relation. I'm now inclined to do it that way, attached.
Attachment
On Sun, Dec 22, 2019 at 10:19 PM Noah Misch <noah@leadboat.com> wrote: > Assert(segno == reln->md_num_open_segs[forknum]); > _fdvec_resize(reln, forknum, segno + 1); Oh yeah, I spotted that part too but didn't follow up. https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNBw%2BuSzxF1os-SO6gUuw%3DcqO5DAybk6KnHKzgGvxhxA%40mail.gmail.com
On Mon, Dec 23, 2019 at 09:33:29AM +1300, Thomas Munro wrote: > On Sun, Dec 22, 2019 at 10:19 PM Noah Misch <noah@leadboat.com> wrote: > > Assert(segno == reln->md_num_open_segs[forknum]); > > _fdvec_resize(reln, forknum, segno + 1); > > Oh yeah, I spotted that part too but didn't follow up. > > https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BNBw%2BuSzxF1os-SO6gUuw%3DcqO5DAybk6KnHKzgGvxhxA%40mail.gmail.com That patch of yours looks good.
Hello. At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch <noah@leadboat.com> wrote in > On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote: > > I am inclined to fix this by decrementing md_num_open_segs before modifying > > md_seg_fds (second attachment). > > That leaked memory, since _fdvec_resize() assumes md_num_open_segs is also the > allocated array length. The alternative is looking better: I agree that v2 is cleaner in the light of modularity and fixes the memory leak happens at re-open. > > An alternative would be to call > > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the > > repalloc() overhead could be noticeable. (mdclose() is called much more > > frequently than mdtruncate().) > > I can skip repalloc() when the array length decreases, to assuage mdclose()'s > worry. In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will > still pfree() the array. Array elements that mdtruncate() frees today will > instead persist to end of transaction. That is okay, since mdtruncate() > crossing more than one segment boundary is fairly infrequent. For it to > happen, you must either create a >2G relation and then TRUNCATE it in the same > transaction, or VACUUM must find >1-2G of unused space at the end of the > relation. I'm now inclined to do it that way, attached. * It doesn't seem worthwhile complicating the code by having a more * aggressive growth strategy here; the number of segments doesn't * grow that fast, and the memory context internally will sometimes - * avoid doing an actual reallocation. + * avoid doing an actual reallocation. Likewise, since the number of + * segments doesn't shrink that fast, don't shrink at all. During + * mdclose(), we'll pfree the array at nseg==0. If I understand it correctly, it is mentioning the number of the all segment files in a fork, not the length of md_seg_fds arrays at a certain moment. But actually _fdvec_resize is called for every segment opening during mdnblocks (just-after mdopen), and every segment closing during mdclose and mdtruncate as mentioned here. We are going to omit pallocs only in the decreasing case. If we regard repalloc as far faster than FileOpen/FileClose or we care about only increase of segment number of mdopen'ed files and don't care the frequent resize that happens during the functions above, then the comment is right and we may resize the array in the segment-by-segment manner. But if they are comparable each other, or we don't want the array gets resized frequently, we might need to prevent repalloc from happening on every segment increase, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote: > At Sun, 22 Dec 2019 12:21:00 -0800, Noah Misch <noah@leadboat.com> wrote in > > On Sun, Dec 22, 2019 at 01:19:30AM -0800, Noah Misch wrote: > > > An alternative would be to call > > > _fdvec_resize() after every FileClose(), like mdtruncate() does; however, the > > > repalloc() overhead could be noticeable. (mdclose() is called much more > > > frequently than mdtruncate().) > > > > I can skip repalloc() when the array length decreases, to assuage mdclose()'s > > worry. In the mdclose() case, the final _fdvec_resize(reln, fork, 0) will > > still pfree() the array. Array elements that mdtruncate() frees today will > > instead persist to end of transaction. That is okay, since mdtruncate() > > crossing more than one segment boundary is fairly infrequent. For it to > > happen, you must either create a >2G relation and then TRUNCATE it in the same > > transaction, or VACUUM must find >1-2G of unused space at the end of the > > relation. I'm now inclined to do it that way, attached. > > * It doesn't seem worthwhile complicating the code by having a more > * aggressive growth strategy here; the number of segments doesn't > * grow that fast, and the memory context internally will sometimes > - * avoid doing an actual reallocation. > + * avoid doing an actual reallocation. Likewise, since the number of > + * segments doesn't shrink that fast, don't shrink at all. During > + * mdclose(), we'll pfree the array at nseg==0. > > If I understand it correctly, it is mentioning the number of the all > segment files in a fork, not the length of md_seg_fds arrays at a > certain moment. But actually _fdvec_resize is called for every segment > opening during mdnblocks (just-after mdopen), and every segment > closing during mdclose and mdtruncate as mentioned here. We are going > to omit pallocs only in the decreasing case. That is a good point. How frequently one adds 1 GiB of data is not the main issue. mdclose() and subsequent re-opening of all segments will be more relevant to overall performance. > If we regard repalloc as far faster than FileOpen/FileClose or we care > about only increase of segment number of mdopen'ed files and don't > care the frequent resize that happens during the functions above, then > the comment is right and we may resize the array in the > segment-by-segment manner. In most cases, the array will fit into a power-of-two chunk, so repalloc() already does the right thing. Once the table has more than ~1000 segments (~1 TiB table size), the allocation will get a single-chunk block, and every subsequent repalloc() will call realloc(). Even then, repalloc() probably is far faster than File operations. Likely, I should just accept the extra repalloc() calls and drop the "else if" change in _fdvec_resize().
At Tue, 24 Dec 2019 11:57:39 -0800, Noah Misch <noah@leadboat.com> wrote in > On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote: > > If I understand it correctly, it is mentioning the number of the all > > segment files in a fork, not the length of md_seg_fds arrays at a > > certain moment. But actually _fdvec_resize is called for every segment > > opening during mdnblocks (just-after mdopen), and every segment > > closing during mdclose and mdtruncate as mentioned here. We are going > > to omit pallocs only in the decreasing case. > > That is a good point. How frequently one adds 1 GiB of data is not the main > issue. mdclose() and subsequent re-opening of all segments will be more > relevant to overall performance. Yes, that's exactly what I meant. > > If we regard repalloc as far faster than FileOpen/FileClose or we care > > about only increase of segment number of mdopen'ed files and don't > > care the frequent resize that happens during the functions above, then > > the comment is right and we may resize the array in the > > segment-by-segment manner. > > In most cases, the array will fit into a power-of-two chunk, so repalloc() > already does the right thing. Once the table has more than ~1000 segments (~1 > TiB table size), the allocation will get a single-chunk block, and every > subsequent repalloc() will call realloc(). Even then, repalloc() probably is > far faster than File operations. Likely, I should just accept the extra > repalloc() calls and drop the "else if" change in _fdvec_resize(). I'm not sure which is better. If we say we know that repalloc(AllocSetRealloc) doesn't free memory at all, there's no point in calling repalloc for shrinking and we could omit that under the name of optimization. If we say we want to free memory as much as possible, we should call repalloc pretending to believe that that happens. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Dec 25, 2019 at 10:39:32AM +0900, Kyotaro Horiguchi wrote: > At Tue, 24 Dec 2019 11:57:39 -0800, Noah Misch <noah@leadboat.com> wrote in > > On Mon, Dec 23, 2019 at 07:41:49PM +0900, Kyotaro Horiguchi wrote: > > > If we regard repalloc as far faster than FileOpen/FileClose or we care > > > about only increase of segment number of mdopen'ed files and don't > > > care the frequent resize that happens during the functions above, then > > > the comment is right and we may resize the array in the > > > segment-by-segment manner. > > > > In most cases, the array will fit into a power-of-two chunk, so repalloc() > > already does the right thing. Once the table has more than ~1000 segments (~1 > > TiB table size), the allocation will get a single-chunk block, and every > > subsequent repalloc() will call realloc(). Even then, repalloc() probably is > > far faster than File operations. Likely, I should just accept the extra > > repalloc() calls and drop the "else if" change in _fdvec_resize(). > > I'm not sure which is better. If we say we know that > repalloc(AllocSetRealloc) doesn't free memory at all, there's no point > in calling repalloc for shrinking and we could omit that under the > name of optimization. If we say we want to free memory as much as > possible, we should call repalloc pretending to believe that that > happens. As long as we free the memory by the end of mdclose(), I think it doesn't matter whether we freed memory in the middle of mdclose(). I ran a crude benchmark that found PathNameOpenFile()+FileClose() costing at least two hundred times as much as the repalloc() pair. Hence, I now plan not to avoid repalloc(), as attached. Crude benchmark code: #define NSEG 9000 for (i = 0; i < count1; i++) { int j; for (j = 0; j < NSEG; ++j) { File f = PathNameOpenFile("/etc/services", O_RDONLY); if (f < 0) elog(ERROR, "fail open: %m"); FileClose(f); } } for (i = 0; i < count2; i++) { int j; void *buf = palloc(1); for (j = 2; j < NSEG; ++j) buf = repalloc(buf, j * 8); while (--j > 0) buf = repalloc(buf, j * 8); }
Attachment
At Wed, 1 Jan 2020 23:46:02 -0800, Noah Misch <noah@leadboat.com> wrote in > On Wed, Dec 25, 2019 at 10:39:32AM +0900, Kyotaro Horiguchi wrote: > > I'm not sure which is better. If we say we know that > > repalloc(AllocSetRealloc) doesn't free memory at all, there's no point > > in calling repalloc for shrinking and we could omit that under the > > name of optimization. If we say we want to free memory as much as > > possible, we should call repalloc pretending to believe that that > > happens. > > As long as we free the memory by the end of mdclose(), I think it doesn't > matter whether we freed memory in the middle of mdclose(). Agreed. > I ran a crude benchmark that found PathNameOpenFile()+FileClose() costing at > least two hundred times as much as the repalloc() pair. Hence, I now plan not > to avoid repalloc(), as attached. Crude benchmark code: I got about 25 times difference with -O0 and about 50 times with -O2. (xfs / CentOS8) It's smaller than I intuitively expected but perhaps 50 times difference is large enough. The patch looks good to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center