Thread: _mdfd_getseg can be expensive
Hi, I recently have seen some perf profiles in which _mdfd_getseg() was in the top #3 when VACUUMing large (~200GB) relations. Called by mdread(), mdwrite(). Looking at it's implementation, I am not surprised. It iterates over all segment entries a relations has; for every read or write. That's not painful for smaller relations, but at a couple of hundred GB it starts to be annoying. Especially if kernel readahead has already read in all data from disk. I don't have a good idea what to do about this yet, but it seems like something that should be fixed mid-term. The best I can come up is is caching the last mdvec used, but that's fairly ugly. Alternatively it might be a good idea to not store MdfdVec as a linked list, but as a densely allocated array. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2014-03-31 12:10:01 +0200, Andres Freund wrote: > I recently have seen some perf profiles in which _mdfd_getseg() was in > the top #3 when VACUUMing large (~200GB) relations. Called by mdread(), > mdwrite(). Looking at it's implementation, I am not surprised. It > iterates over all segment entries a relations has; for every read or > write. That's not painful for smaller relations, but at a couple of > hundred GB it starts to be annoying. Especially if kernel readahead has > already read in all data from disk. > > I don't have a good idea what to do about this yet, but it seems like > something that should be fixed mid-term. > > The best I can come up is is caching the last mdvec used, but that's > fairly ugly. Alternatively it might be a good idea to not store MdfdVec > as a linked list, but as a densely allocated array. I've seen this a couple times more since. On larger relations it gets even more pronounced. When sequentially scanning a 2TB relation, _mdfd_getseg() gets up to 80% proportionate CPU time towards the end of the scan. I wrote the attached patch that get rids of that essentially quadratic behaviour, by replacing the mdfd chain/singly linked list with an array. Since we seldomly grow files by a whole segment I can't see the slightly bigger memory reallocations matter significantly. In pretty much every other case the array is bound to be a winner. Does anybody have fundamental arguments against that idea? With some additional work we could save a bit more memory by getting rid of the mdfd_segno as it's essentially redundant - but that's not entirely trivial and I'm unsure if it's worth it. I've also attached a second patch that makes PageIsVerified() noticeably faster when the page is new. That's helpful and related because it makes it easier to test the correctness of the md.c rewrite by faking full 1GB segments. It's also pretty simple, so there's imo little reason not to do it. Greetings, Andres Freund PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633 -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Andres Freund <andres@2ndquadrant.com> writes: > I wrote the attached patch that get rids of that essentially quadratic > behaviour, by replacing the mdfd chain/singly linked list with an > array. Since we seldomly grow files by a whole segment I can't see the > slightly bigger memory reallocations matter significantly. In pretty > much every other case the array is bound to be a winner. > Does anybody have fundamental arguments against that idea? While the basic idea is sound, this particular implementation seems pretty bizarre. What's with the "md_seg_no" stuff, and why is that array typed size_t? IOW, why didn't you *just* replace the linked list with an array? This patch seems to be making some other changes that you've failed to explain. regards, tom lane
On 2014-10-31 18:48:45 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I wrote the attached patch that get rids of that essentially quadratic > > behaviour, by replacing the mdfd chain/singly linked list with an > > array. Since we seldomly grow files by a whole segment I can't see the > > slightly bigger memory reallocations matter significantly. In pretty > > much every other case the array is bound to be a winner. > > > Does anybody have fundamental arguments against that idea? > > While the basic idea is sound, this particular implementation seems > pretty bizarre. What's with the "md_seg_no" stuff, and why is that > array typed size_t? > IOW, why didn't you *just* replace the linked list with an array? It stores the length of the array of _MdfdVec entries. To know whether it's safe to access some element we first need to check whether we've loaded that many entries. It's size_t[] because that seemed to be the most appropriate type for the lenght of an array. It's an array because md.c had already chosen to represent relation forks via an array indexed by the fork. So size_t md_seg_no[MAX_FORKNUM + 1]; contains the length of the _MdfdVec array for each fork. These arrays are stored in: struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; > This patch seems to be making some other changes that you've failed to > explain. I'm not aware of any that aren't just a consequence of not iterating through the linked list anymore. What change are you thinking of? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-10-31 18:48:45 -0400, Tom Lane wrote: >> While the basic idea is sound, this particular implementation seems >> pretty bizarre. What's with the "md_seg_no" stuff, and why is that >> array typed size_t? > It stores the length of the array of _MdfdVec entries. Oh. "seg_no" seems like not a very good choice of name then. Perhaps "md_seg_count" or something like that would be more intelligible. And personally I'd have made it an int, because we are certainly not doing segment-number arithmetic in anything wider than int anywhere else. Introducing size_t into the mix won't do anything except create a risk of signed-vs-unsigned logic bugs. regards, tom lane
On 2014-11-01 12:57:40 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-10-31 18:48:45 -0400, Tom Lane wrote: > >> While the basic idea is sound, this particular implementation seems > >> pretty bizarre. What's with the "md_seg_no" stuff, and why is that > >> array typed size_t? > > > It stores the length of the array of _MdfdVec entries. > > Oh. "seg_no" seems like not a very good choice of name then. > Perhaps "md_seg_count" or something like that would be more intelligible. That's fine with me. > And personally I'd have made it an int, because we are certainly not doing > segment-number arithmetic in anything wider than int anywhere else. Fine with me too. I picked size_t by habit, because there's projects that don't allow anything else to be used for lengths of memory... I've, during testing, also noticed it has accidentally introduced a vfd/memory leak... So I'll repost a version with those fixes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-11-01 18:23:47 +0100, Andres Freund wrote: > On 2014-11-01 12:57:40 -0400, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > On 2014-10-31 18:48:45 -0400, Tom Lane wrote: > > >> While the basic idea is sound, this particular implementation seems > > >> pretty bizarre. What's with the "md_seg_no" stuff, and why is that > > >> array typed size_t? > > > > > It stores the length of the array of _MdfdVec entries. > > > > Oh. "seg_no" seems like not a very good choice of name then. > > Perhaps "md_seg_count" or something like that would be more intelligible. > > That's fine with me. Went with md_num_open_segs - reading the code that was easier to understand. > So I'll repost a version with those fixes. Took a while. But here we go. The attached version is a significantly revised version of my earlier patch. Notably I've pretty much entirely revised the code in _mdfd_getseg() to be more similar to the state in master. Also some comment policing. As it's more than a bit painful to test with large numbers of 1GB segments, I've rebuilt my local postgres with 100kb segments. Running a pgbench -s 300 with 128MB shared buffers clearly shows the efficiency differnces: 320tps vs 4900 in an assertion enabled build. Obviously that's kind of an artificial situation... But I've also verified this with a) fake relations built out of sparse files, b) actually large relations. The latter shows performance benefits as well, but my patience limits the amount of testing I was willing to do... Kevin, Robert: I've CCed you because Robert commented in the recent mdnblocks() blocks thread that Kevin lamented the O(n)'ness of md.c... Andres
Attachment
On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote: > Took a while. But here we go. The attached version is a significantly > revised version of my earlier patch. Notably I've pretty much entirely > revised the code in _mdfd_getseg() to be more similar to the state in > master. Also some comment policing. Are you planning to pick this up again, Andres? -- Peter Geoghegan
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote: > On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote: > > Took a while. But here we go. The attached version is a significantly > > revised version of my earlier patch. Notably I've pretty much entirely > > revised the code in _mdfd_getseg() to be more similar to the state in > > master. Also some comment policing. > > Are you planning to pick this up again, Andres? I plan to, once the tree opens again. Likely needs some considerable updates for recent changes. Andres
On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund <andres@anarazel.de> wrote: > I plan to, once the tree opens again. Likely needs some considerable > updates for recent changes. Offhand, do you think that CREATE INDEX calls to smgrextend() could be appreciably affected by this bottleneck? If that's a very involved or difficult question, then no need to answer now. -- Peter Geoghegan
On 2016-06-30 18:34:20 -0700, Peter Geoghegan wrote: > On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund <andres@anarazel.de> wrote: > > I plan to, once the tree opens again. Likely needs some considerable > > updates for recent changes. > > Offhand, do you think that CREATE INDEX calls to smgrextend() could be > appreciably affected by this bottleneck? If that's a very involved or > difficult question, then no need to answer now. If you have a big enough index (maybe ~150GB+), sure. Before that, probably not. It's usually pretty easy to see in cpu profiles whether this issue exists. Andres
On Thu, Jun 30, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote: > If you have a big enough index (maybe ~150GB+), sure. Before that, > probably not. > > It's usually pretty easy to see in cpu profiles whether this issue > exists. I think that this is a contributing factor to why merging in parallel CREATE INDEX becomes much more CPU bound when building such very large indexes, which Corey Huinker has benchmarked using an advanced copy of the patch. He has shown cases that are sped up by 3.6x when 8 parallel workers are used (compared to a serial CREATE INDEX), but a several hundred gigabyte index case only sees a speedup of about 1.5x. (This bottleneck affects serial CREATE INDEX merging just as much as parallel, since that part isn't parallelized, but it's far more noticeable with parallel CREATE INDEX simply because merging in the leader becomes a huge bottleneck). Those two cases were not exactly comparable in perhaps several other ways, but even still my sense is that that this can be at least partially explained by md.c bottlenecks. This is something that we'll need to confirm through profiling. Hopefully it's just this one bottleneck. -- Peter Geoghegan
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote: > On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote: > > Took a while. But here we go. The attached version is a significantly > > revised version of my earlier patch. Notably I've pretty much entirely > > revised the code in _mdfd_getseg() to be more similar to the state in > > master. Also some comment policing. > > Are you planning to pick this up again, Andres? Rebased version attached. A review would be welcome. Plan to push this forward otherwise in the not too far away future. Andres
Attachment
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote: > Rebased version attached. A review would be welcome. Plan to push this > forward otherwise in the not too far away future. I can review this next week. -- Peter Geoghegan
On 2016-08-18 17:27:59 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote: > > Rebased version attached. A review would be welcome. Plan to push this > > forward otherwise in the not too far away future. > > I can review this next week. Thanks
On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote: >> I can review this next week. > > Thanks Given the time frame that you have in mind, I won't revisit the question the parallel CLUSTER CPU bottleneck issue until this is committed. The patch might change things enough that that would be a waste of time. -- Peter Geoghegan
On 2016-08-18 17:35:47 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote: > >> I can review this next week. > > > > Thanks > > Given the time frame that you have in mind, I won't revisit the > question the parallel CLUSTER CPU bottleneck issue until this is > committed. The patch might change things enough that that would be a > waste of time. How large was the index & table in question? I mean this really only comes into effect at 100+ segments.
On Thu, Aug 18, 2016 at 5:42 PM, Andres Freund <andres@anarazel.de> wrote: > How large was the index & table in question? I mean this really only > comes into effect at 100+ segments. Not that big, but I see no reason to take the chance, I suppose. -- Peter Geoghegan
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote: > Rebased version attached. A review would be welcome. Plan to push this > forward otherwise in the not too far away future. This looks good. The only thing that stuck out to any degree is that we don't grow the "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() function geometrically. In other words, we don't separately keep track of the array size and allocated size, and grow the allocated size using the doubling strategy that you see in many places. Now, I freely admit that that probably doesn't matter, given what that array tracks. I'm just pointing out that that aspect did give me pause. The struct MdfdVec is small enough that that *might* be worthwhile. -- Peter Geoghegan
On Wed, Aug 31, 2016 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote: > The only thing that stuck out to any degree is that we don't grow the > "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() > function geometrically. That new function looks like this: > +static void > +_fdvec_resize(SMgrRelation reln, > + ForkNumber forknum, > + int nseg) > { > - return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec)); > + if (nseg == 0) > + { > + if (reln->md_num_open_segs[forknum] > 0) > + { > + pfree(reln->md_seg_fds[forknum]); > + reln->md_seg_fds[forknum] = NULL; > + } > + } > + else if (reln->md_num_open_segs[forknum] == 0) > + { > + reln->md_seg_fds[forknum] = > + MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg); > + } > + else > + { > + reln->md_seg_fds[forknum] = > + repalloc(reln->md_seg_fds[forknum], > + sizeof(MdfdVec) * nseg); > + } > + > + reln->md_num_open_segs[forknum] = nseg; > } Another tiny tweak that you might consider occurs to me here: It couldn't hurt to "Assert(reln->md_seg_fds[forknum] == NULL)" within the "else if (reln->md_num_open_segs[forknum] == 0)" path here, prior to the MemoryContextAlloc(). If it's worth setting it to NULL when there are 0 segs (i.e., "reln->md_seg_fds[forknum] = NULL"), then perhaps it's worth checking that things are found that way later. I guess that this is at odds with remarks in my last mail about considering geometric allocations for the "reln->md_seg_fds[forknum]" array. Both feedback items are more or less just things that bothered me ever so slightly, which I don't necessarily expect you to act on, but assume you'd like to hear. -- Peter Geoghegan
On 2016-08-31 14:09:47 -0700, Peter Geoghegan wrote: > On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote: > > Rebased version attached. A review would be welcome. Plan to push this > > forward otherwise in the not too far away future. > > This looks good. Thanks for looking! > The only thing that stuck out to any degree is that we don't grow the > "reln->md_seg_fds[forknum]" array within the new _fdvec_resize() > function geometrically. In other words, we don't separately keep track > of the array size and allocated size, and grow the allocated size > using the doubling strategy that you see in many places. I can't really see that being worth the code here. The cost of open()/lseek()'ing etc. is going to dwarf the cost of this. Regards, Andres
On Wed, Aug 31, 2016 at 2:37 PM, Andres Freund <andres@anarazel.de> wrote: >> This looks good. > > Thanks for looking! No problem. In other painfully pedantic news, I should point out that sizeof(size_t) isn't necessarily word size (the most generic definition of word size for the architecture), contrary to my reading of the 0002-* patch comments. I'm mostly talking thinking about x86_64 here, of course. -- Peter Geoghegan
On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote: >In other painfully pedantic news, I should point out that >sizeof(size_t) isn't necessarily word size (the most generic >definition of word size for the architecture), contrary to my reading >of the 0002-* patch comments. I'm mostly talking thinking about x86_64 >here, of course. Uh? -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote: > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote: > >>In other painfully pedantic news, I should point out that >>sizeof(size_t) isn't necessarily word size (the most generic >>definition of word size for the architecture), contrary to my reading >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64 >>here, of course. > > Uh? Sorry, I really should have not said anything. It is true that x86_64 word size is sometimes reported as 16 and/or 32 bits [1], because of legacy issues. [1] https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes -- Peter Geoghegan
On 2016-08-31 15:15:16 -0700, Peter Geoghegan wrote: > On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote: > > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote: > > > >>In other painfully pedantic news, I should point out that > >>sizeof(size_t) isn't necessarily word size (the most generic > >>definition of word size for the architecture), contrary to my reading > >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64 > >>here, of course. > > > > Uh? > > Sorry, I really should have not said anything. It is true that x86_64 > word size is sometimes reported as 16 and/or 32 bits [1], because of > legacy issues. I think native word size describes the issue well enough. And, more importantly, I can't think of an equally short but more accurate description. I've pushed the patches. Thanks for the review. Andres