Thread: _mdfd_getseg can be expensive

_mdfd_getseg can be expensive

From
Andres Freund
Date:
Hi,

I recently have seen some perf profiles in which _mdfd_getseg() was in
the top #3 when VACUUMing large (~200GB) relations. Called by mdread(),
mdwrite(). Looking at it's implementation, I am not surprised. It
iterates over all segment entries a relations has; for every read or
write. That's not painful for smaller relations, but at a couple of
hundred GB it starts to be annoying. Especially if kernel readahead has
already read in all data from disk.

I don't have a good idea what to do about this yet, but it seems like
something that should be fixed mid-term.

The best I can come up is is caching the last mdvec used, but that's
fairly ugly. Alternatively it might be a good idea to not store MdfdVec
as a linked list, but as a densely allocated array.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
Hi,

On 2014-03-31 12:10:01 +0200, Andres Freund wrote:
> I recently have seen some perf profiles in which _mdfd_getseg() was in
> the top #3 when VACUUMing large (~200GB) relations. Called by mdread(),
> mdwrite(). Looking at it's implementation, I am not surprised. It
> iterates over all segment entries a relations has; for every read or
> write. That's not painful for smaller relations, but at a couple of
> hundred GB it starts to be annoying. Especially if kernel readahead has
> already read in all data from disk.
>
> I don't have a good idea what to do about this yet, but it seems like
> something that should be fixed mid-term.
>
> The best I can come up is is caching the last mdvec used, but that's
> fairly ugly. Alternatively it might be a good idea to not store MdfdVec
> as a linked list, but as a densely allocated array.

I've seen this a couple times more since. On larger relations it gets
even more pronounced. When sequentially scanning a 2TB relation,
_mdfd_getseg() gets up to 80% proportionate CPU time towards the end of
the scan.

I wrote the attached patch that get rids of that essentially quadratic
behaviour, by replacing the mdfd chain/singly linked list with an
array. Since we seldomly grow files by a whole segment I can't see the
slightly bigger memory reallocations matter significantly. In pretty
much every other case the array is bound to be a winner.

Does anybody have fundamental arguments against that idea?

With some additional work we could save a bit more memory by getting rid
of the mdfd_segno as it's essentially redundant - but that's not
entirely trivial and I'm unsure if it's worth it.

I've also attached a second patch that makes PageIsVerified() noticeably
faster when the page is new. That's helpful and related because it makes
it easier to test the correctness of the md.c rewrite by faking full 1GB
segments. It's also pretty simple, so there's imo little reason not to
do it.

Greetings,

Andres Freund

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: _mdfd_getseg can be expensive

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> I wrote the attached patch that get rids of that essentially quadratic
> behaviour, by replacing the mdfd chain/singly linked list with an
> array. Since we seldomly grow files by a whole segment I can't see the
> slightly bigger memory reallocations matter significantly. In pretty
> much every other case the array is bound to be a winner.

> Does anybody have fundamental arguments against that idea?

While the basic idea is sound, this particular implementation seems
pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
array typed size_t?  IOW, why didn't you *just* replace the linked
list with an array?  This patch seems to be making some other changes
that you've failed to explain.
        regards, tom lane



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I wrote the attached patch that get rids of that essentially quadratic
> > behaviour, by replacing the mdfd chain/singly linked list with an
> > array. Since we seldomly grow files by a whole segment I can't see the
> > slightly bigger memory reallocations matter significantly. In pretty
> > much every other case the array is bound to be a winner.
> 
> > Does anybody have fundamental arguments against that idea?
> 
> While the basic idea is sound, this particular implementation seems
> pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
> array typed size_t?
> IOW, why didn't you *just* replace the linked list with an array?

It stores the length of the array of _MdfdVec entries. To know whether
it's safe to access some element we first need to check whether we've
loaded that many entries. It's size_t[] because that seemed to be the
most appropriate type for the lenght of an array. It's an array because
md.c had already chosen to represent relation forks via an array indexed
by the fork.

So  size_t           md_seg_no[MAX_FORKNUM + 1];
contains the length of the _MdfdVec array for each fork. These arrays
are stored in:  struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];

> This patch seems to be making some other changes that you've failed to
> explain.

I'm not aware of any that aren't just a consequence of not iterating
through the linked list anymore. What change are you thinking of?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: _mdfd_getseg can be expensive

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
>> While the basic idea is sound, this particular implementation seems
>> pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
>> array typed size_t?

> It stores the length of the array of _MdfdVec entries.

Oh.  "seg_no" seems like not a very good choice of name then.
Perhaps "md_seg_count" or something like that would be more intelligible.

And personally I'd have made it an int, because we are certainly not doing
segment-number arithmetic in anything wider than int anywhere else.
Introducing size_t into the mix won't do anything except create a risk of
signed-vs-unsigned logic bugs.
        regards, tom lane



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2014-11-01 12:57:40 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
> >> While the basic idea is sound, this particular implementation seems
> >> pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
> >> array typed size_t?
> 
> > It stores the length of the array of _MdfdVec entries.
> 
> Oh.  "seg_no" seems like not a very good choice of name then.
> Perhaps "md_seg_count" or something like that would be more intelligible.

That's fine with me.

> And personally I'd have made it an int, because we are certainly not doing
> segment-number arithmetic in anything wider than int anywhere else.

Fine with me too. I picked size_t by habit, because there's projects
that don't allow anything else to be used for lengths of memory...

I've, during testing, also noticed it has accidentally introduced a
vfd/memory leak...

So I'll repost a version with those fixes.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2014-11-01 18:23:47 +0100, Andres Freund wrote:
> On 2014-11-01 12:57:40 -0400, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2014-10-31 18:48:45 -0400, Tom Lane wrote:
> > >> While the basic idea is sound, this particular implementation seems
> > >> pretty bizarre.  What's with the "md_seg_no" stuff, and why is that
> > >> array typed size_t?
> >
> > > It stores the length of the array of _MdfdVec entries.
> >
> > Oh.  "seg_no" seems like not a very good choice of name then.
> > Perhaps "md_seg_count" or something like that would be more intelligible.
>
> That's fine with me.

Went with md_num_open_segs - reading the code that was easier to
understand.

> So I'll repost a version with those fixes.

Took a while. But here we go. The attached version is a significantly
revised version of my earlier patch. Notably I've pretty much entirely
revised the code in _mdfd_getseg() to be more similar to the state in
master. Also some comment policing.

As it's more than a bit painful to test with large numbers of 1GB
segments, I've rebuilt my local postgres with 100kb segments. Running a
pgbench -s 300 with 128MB shared buffers clearly shows the efficiency
differnces: 320tps vs 4900 in an assertion enabled build. Obviously
that's kind of an artificial situation...

But I've also verified this with a) fake relations built out of sparse
files, b) actually large relations. The latter shows performance
benefits as well, but my patience limits the amount of testing I was
willing to do...

Kevin, Robert: I've CCed you because Robert commented in the recent
mdnblocks() blocks thread that Kevin lamented the O(n)'ness of md.c...

Andres

Attachment

Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
> Took a while. But here we go. The attached version is a significantly
> revised version of my earlier patch. Notably I've pretty much entirely
> revised the code in _mdfd_getseg() to be more similar to the state in
> master. Also some comment policing.

Are you planning to pick this up again, Andres?

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote:
> On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
> > Took a while. But here we go. The attached version is a significantly
> > revised version of my earlier patch. Notably I've pretty much entirely
> > revised the code in _mdfd_getseg() to be more similar to the state in
> > master. Also some comment policing.
> 
> Are you planning to pick this up again, Andres?

I plan to, once the tree opens again. Likely needs some considerable
updates for recent changes.

Andres



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund <andres@anarazel.de> wrote:
> I plan to, once the tree opens again. Likely needs some considerable
> updates for recent changes.

Offhand, do you think that CREATE INDEX calls to smgrextend() could be
appreciably affected by this bottleneck? If that's a very involved or
difficult question, then no need to answer now.

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-06-30 18:34:20 -0700, Peter Geoghegan wrote:
> On Thu, Jun 30, 2016 at 6:23 PM, Andres Freund <andres@anarazel.de> wrote:
> > I plan to, once the tree opens again. Likely needs some considerable
> > updates for recent changes.
> 
> Offhand, do you think that CREATE INDEX calls to smgrextend() could be
> appreciably affected by this bottleneck? If that's a very involved or
> difficult question, then no need to answer now.

If you have a big enough index (maybe ~150GB+), sure. Before that,
probably not.

It's usually pretty easy to see in cpu profiles whether this issue
exists.

Andres



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Jun 30, 2016 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
> If you have a big enough index (maybe ~150GB+), sure. Before that,
> probably not.
>
> It's usually pretty easy to see in cpu profiles whether this issue
> exists.

I think that this is a contributing factor to why merging in parallel
CREATE INDEX becomes much more CPU bound when building such very large
indexes, which Corey Huinker has benchmarked using an advanced copy of
the patch. He has shown cases that are sped up by 3.6x when 8 parallel
workers are used (compared to a serial CREATE INDEX), but a several
hundred gigabyte index case only sees a speedup of about 1.5x. (This
bottleneck affects serial CREATE INDEX merging just as much as
parallel, since that part isn't parallelized, but it's far more
noticeable with parallel CREATE INDEX simply because merging in the
leader becomes a huge bottleneck).

Those two cases were not exactly comparable in perhaps several other
ways, but even still my sense is that that this can be at least
partially explained by md.c bottlenecks. This is something that we'll
need to confirm through profiling. Hopefully it's just this one
bottleneck.

--
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-06-30 18:14:15 -0700, Peter Geoghegan wrote:
> On Tue, Dec 15, 2015 at 10:04 AM, Andres Freund <andres@anarazel.de> wrote:
> > Took a while. But here we go. The attached version is a significantly
> > revised version of my earlier patch. Notably I've pretty much entirely
> > revised the code in _mdfd_getseg() to be more similar to the state in
> > master. Also some comment policing.
>
> Are you planning to pick this up again, Andres?

Rebased version attached. A review would be welcome. Plan to push this
forward otherwise in the not too far away future.

Andres

Attachment

Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote:
> Rebased version attached. A review would be welcome. Plan to push this
> forward otherwise in the not too far away future.

I can review this next week.


-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-08-18 17:27:59 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote:
> > Rebased version attached. A review would be welcome. Plan to push this
> > forward otherwise in the not too far away future.
> 
> I can review this next week.

Thanks



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
>> I can review this next week.
>
> Thanks

Given the time frame that you have in mind, I won't revisit the
question the parallel CLUSTER CPU bottleneck issue until this is
committed. The patch might change things enough that that would be a
waste of time.

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-08-18 17:35:47 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:28 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I can review this next week.
> >
> > Thanks
> 
> Given the time frame that you have in mind, I won't revisit the
> question the parallel CLUSTER CPU bottleneck issue until this is
> committed. The patch might change things enough that that would be a
> waste of time.

How large was the index & table in question? I mean this really only
comes into effect at 100+ segments.



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:42 PM, Andres Freund <andres@anarazel.de> wrote:
> How large was the index & table in question? I mean this really only
> comes into effect at 100+ segments.

Not that big, but I see no reason to take the chance, I suppose.

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote:
> Rebased version attached. A review would be welcome. Plan to push this
> forward otherwise in the not too far away future.

This looks good.

The only thing that stuck out to any degree is that we don't grow the
"reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
function geometrically. In other words, we don't separately keep track
of the array size and allocated size, and grow the allocated size
using the doubling strategy that you see in many places.

Now, I freely admit that that probably doesn't matter, given what that
array tracks. I'm just pointing out that that aspect did give me
pause. The struct MdfdVec is small enough that that *might* be
worthwhile.

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Wed, Aug 31, 2016 at 2:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically.

That new function looks like this:

> +static void
> +_fdvec_resize(SMgrRelation reln,
> +             ForkNumber forknum,
> +             int nseg)
>  {
> -   return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec));
> +   if (nseg == 0)
> +   {
> +       if (reln->md_num_open_segs[forknum] > 0)
> +       {
> +           pfree(reln->md_seg_fds[forknum]);
> +           reln->md_seg_fds[forknum] = NULL;
> +       }
> +   }
> +   else if (reln->md_num_open_segs[forknum] == 0)
> +   {
> +       reln->md_seg_fds[forknum] =
> +           MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg);
> +   }
> +   else
> +   {
> +       reln->md_seg_fds[forknum] =
> +           repalloc(reln->md_seg_fds[forknum],
> +                    sizeof(MdfdVec) * nseg);
> +   }
> +
> +   reln->md_num_open_segs[forknum] = nseg;
>  }

Another tiny tweak that you might consider occurs to me here: It
couldn't hurt to "Assert(reln->md_seg_fds[forknum] == NULL)" within
the "else if (reln->md_num_open_segs[forknum] == 0)" path here, prior
to the MemoryContextAlloc(). If it's worth setting it to NULL when
there are 0 segs (i.e., "reln->md_seg_fds[forknum] = NULL"), then
perhaps it's worth checking that things are found that way later.

I guess that this is at odds with remarks in my last mail about
considering geometric allocations for the "reln->md_seg_fds[forknum]"
array. Both feedback items are more or less just things that bothered
me ever so slightly, which I don't necessarily expect you to act on,
but assume you'd like to hear.

-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-08-31 14:09:47 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund <andres@anarazel.de> wrote:
> > Rebased version attached. A review would be welcome. Plan to push this
> > forward otherwise in the not too far away future.
> 
> This looks good.

Thanks for looking!


> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically. In other words, we don't separately keep track
> of the array size and allocated size, and grow the allocated size
> using the doubling strategy that you see in many places.

I can't really see that being worth the code here. The cost of
open()/lseek()'ing etc. is going to dwarf the cost of this.


Regards,

Andres



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Wed, Aug 31, 2016 at 2:37 PM, Andres Freund <andres@anarazel.de> wrote:
>> This looks good.
>
> Thanks for looking!

No problem.

In other painfully pedantic news, I should point out that
sizeof(size_t) isn't necessarily word size (the most generic
definition of word size for the architecture), contrary to my reading
of the 0002-* patch comments. I'm mostly talking thinking about x86_64
here, of course.


-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:

On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote:

>In other painfully pedantic news, I should point out that
>sizeof(size_t) isn't necessarily word size (the most generic
>definition of word size for the architecture), contrary to my reading
>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>here, of course.

Uh?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: _mdfd_getseg can be expensive

From
Peter Geoghegan
Date:
On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote:
> On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote:
>
>>In other painfully pedantic news, I should point out that
>>sizeof(size_t) isn't necessarily word size (the most generic
>>definition of word size for the architecture), contrary to my reading
>>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>>here, of course.
>
> Uh?

Sorry, I really should have not said anything. It is true that x86_64
word size is sometimes reported as 16 and/or 32 bits [1], because of
legacy issues.

[1] https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes
-- 
Peter Geoghegan



Re: _mdfd_getseg can be expensive

From
Andres Freund
Date:
On 2016-08-31 15:15:16 -0700, Peter Geoghegan wrote:
> On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund <andres@anarazel.de> wrote:
> > On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan <pg@heroku.com> wrote:
> >
> >>In other painfully pedantic news, I should point out that
> >>sizeof(size_t) isn't necessarily word size (the most generic
> >>definition of word size for the architecture), contrary to my reading
> >>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
> >>here, of course.
> >
> > Uh?
> 
> Sorry, I really should have not said anything. It is true that x86_64
> word size is sometimes reported as 16 and/or 32 bits [1], because of
> legacy issues.

I think native word size describes the issue well enough. And, more
importantly, I can't think of an equally short but more accurate
description.

I've pushed the patches.  Thanks for the review.

Andres