Thread: avoiding tuple copying in btree index builds
Hi, Today, I discovered that when building a btree index, the btree code uses index_form_tuple() to create an index tuple from the heap tuple, calls tuplesort_putindextuple() to copy that tuple into the sort's memory context, and then frees the original one it built. This seemed inefficient, so I wrote a patch to eliminate the tuple copying. It works by adding a function tuplesort_putindextuplevalues(), which builds the tuple in the sort's memory context and thus avoids the need for a separate copy. I'm not sure if that's the best approach, but the optimization seems wortwhile. I tested it by repeatedly executing "REINDEX INDEX pgbench_accounts_pkey" on a PPC64 machine. pgbench_accounts contains 10 million records. With unpatched master as of b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s, 6.177s, and 6.201s. With the attached patch, I got times of 5.787s, 5.972s, and 5.913s, a savings of almost 5%. Not bad considering the amount of work involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2014-05-05 13:52:39 -0400, Robert Haas wrote: > Today, I discovered that when building a btree index, the btree code > uses index_form_tuple() to create an index tuple from the heap tuple, > calls tuplesort_putindextuple() to copy that tuple into the sort's > memory context, and then frees the original one it built. This seemed > inefficient, so I wrote a patch to eliminate the tuple copying. It > works by adding a function tuplesort_putindextuplevalues(), which > builds the tuple in the sort's memory context and thus avoids the need > for a separate copy. I'm not sure if that's the best approach, but > the optimization seems wortwhile. Hm. It looks like we could quite easily just get rid of tuplesort_putindextuple(). The hash usage doesn't look hard to convert. > I tested it by repeatedly executing "REINDEX INDEX > pgbench_accounts_pkey" on a PPC64 machine. pgbench_accounts contains > 10 million records. With unpatched master as of > b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s, > 6.177s, and 6.201s. With the attached patch, I got times of 5.787s, > 5.972s, and 5.913s, a savings of almost 5%. Not bad considering the > amount of work involved. Yes, that's certainly worthwile. Nice. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-05-05 13:52:39 -0400, Robert Haas wrote: >> Today, I discovered that when building a btree index, the btree code >> uses index_form_tuple() to create an index tuple from the heap tuple, >> calls tuplesort_putindextuple() to copy that tuple into the sort's >> memory context, and then frees the original one it built. This seemed >> inefficient, so I wrote a patch to eliminate the tuple copying. It >> works by adding a function tuplesort_putindextuplevalues(), which >> builds the tuple in the sort's memory context and thus avoids the need >> for a separate copy. I'm not sure if that's the best approach, but >> the optimization seems wortwhile. > > Hm. It looks like we could quite easily just get rid of > tuplesort_putindextuple(). The hash usage doesn't look hard to convert. I glanced at that, but it wasn't obvious to me how to convert the hash usage. If you have an idea, I'm all ears. >> I tested it by repeatedly executing "REINDEX INDEX >> pgbench_accounts_pkey" on a PPC64 machine. pgbench_accounts contains >> 10 million records. With unpatched master as of >> b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s, >> 6.177s, and 6.201s. With the attached patch, I got times of 5.787s, >> 5.972s, and 5.913s, a savings of almost 5%. Not bad considering the >> amount of work involved. > > Yes, that's certainly worthwile. Nice. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
> >> Today, I discovered that when building a btree index, the btree code
> >> uses index_form_tuple() to create an index tuple from the heap tuple,
> >> calls tuplesort_putindextuple() to copy that tuple into the sort's
> >> memory context, and then frees the original one it built. This seemed
> >> inefficient, so I wrote a patch to eliminate the tuple copying. It
> >> works by adding a function tuplesort_putindextuplevalues(), which
> >> builds the tuple in the sort's memory context and thus avoids the need
> >> for a separate copy. I'm not sure if that's the best approach, but
> >> the optimization seems wortwhile.
> >
> > Hm. It looks like we could quite easily just get rid of
> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert.
>
> I glanced at that, but it wasn't obvious to me how to convert the hash
> usage. If you have an idea, I'm all ears.
_h_spool(), we can traverse the isnull array.
On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, May 6, 2014 at 12:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> >> wrote: >> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote: >> >> Today, I discovered that when building a btree index, the btree code >> >> uses index_form_tuple() to create an index tuple from the heap tuple, >> >> calls tuplesort_putindextuple() to copy that tuple into the sort's >> >> memory context, and then frees the original one it built. This seemed >> >> inefficient, so I wrote a patch to eliminate the tuple copying. It >> >> works by adding a function tuplesort_putindextuplevalues(), which >> >> builds the tuple in the sort's memory context and thus avoids the need >> >> for a separate copy. I'm not sure if that's the best approach, but >> >> the optimization seems wortwhile. >> > >> > Hm. It looks like we could quite easily just get rid of >> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert. >> >> I glanced at that, but it wasn't obvious to me how to convert the hash >> usage. If you have an idea, I'm all ears. > > I also think it's possible to have similar optimization for hash index > incase it has to spool the tuple for sorting. > > In function hashbuildCallback(), when buildstate->spool is true, we > can avoid to form index tuple. To check for nulls before calling > > _h_spool(), we can traverse the isnull array. Hmm, that might work. Arguably it's less efficient, but on the other hand if it avoids forming the tuple sometimes it might be MORE efficient. And anyway the difference might not be enough to matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I also think it's possible to have similar optimization for hash index
> > incase it has to spool the tuple for sorting.
> >
> > In function hashbuildCallback(), when buildstate->spool is true, we
> > can avoid to form index tuple. To check for nulls before calling
> >
> > _h_spool(), we can traverse the isnull array.
>
> Hmm, that might work. Arguably it's less efficient, but on the other
> hand if it avoids forming the tuple sometimes it might be MORE
> efficient. And anyway the difference might not be enough to matter.
Considering the fact that for hash indexes, this optimization would
tuplesort_putindextuple() from code, if this optimization is done for hash
indexes. However we can leave it as it is for now as there doesn't seem
to be any gain by doing so.
You seem to have removed puttuple_common() call from function
tuplesort_putindextuple(), is there any reason for doing so?
Apart from that patch looks good to me.
Below performance data on various size of indexes shows that this
patch can improve performance upto ~7%. The performance
difference becomes lesser when the index size is too big and I think
that is probably due to the reason that we have to write all the data
at end of operation, so if the data is big the improvement is not shown
due to large I/O. The performance improvement is shown considering
median value of 3 runs and time is taken by enabling \timing option
of psql.
Performance Data
---------------------------
Configuration:
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
shared_buffers=8GB
pgbench_accounts
No. of Records -10million
Index - on integer
Operation - Reindex
Master
16043.500 ms
16058.723 ms
15941.057 ms
Patch
15525.054 ms
15551.935 ms
15492.879 ms
Perf Improvement: 3.43%
pgbench_accounts
No. of Records -30million
Index - on integer
Operation - Reindex
Master
51258.338 ms
50520.328 ms
50562.022 ms
Patch
49610.710 ms
49302.571 ms
49301.390 ms
Perf Improvement: 2.41%
table (c1 int, c2 char(10))
No. of records = 300,000
Index - on char(10)
Operation - Reindex
Master
443.584 ms
444.798 ms
452.888 ms
Patch
421.554 ms
430.528 ms
447.558 ms
Performance Improvement: 3.2%
table (c1 int, c2 char(10))
No. of records = 500,000
Index - on char(10)
Operation - Reindex
Master
663.621 ms
661.299 ms
657.754 ms
Patch
652.325 ms
644.782 ms
643.218 ms
Performance Improvement: 2.5%
table (c1 int, c2 char(10))
No. of records = 1000,000
Index - on char(10)
Operation - Reindex
Master
16554.076 ms
16686.528 ms
16571.129 ms
Patch
16556.852 ms
16513.543 ms
16610.615 ms
Performance Improvement: less than 1%
table (c1 int, c2 char(20))
No. of records = 300,000
Index - on char(20)
Operation - Reindex
Master
429.670 ms
441.445 ms
411.539 ms
Patch
401.801 ms
412.716 ms
395.002 ms
Performance Improvement: 6.48%
table (c1 int, c2 char(20))
No. of records = 500,000
Index - on char(20)
Operation - Reindex
Master
724.541 ms
731.582 ms
704.934 ms
Patch
686.004 ms
677.361 ms
686.915 ms
Performance Improvement: 5.31%
table (c1 int, c2 char(20))
No. of records = 1000,000
Index - on char(20)
Operation - Reindex
Master
20728.758 ms
20665.289 ms
20656.375 ms
Patch
20594.022 ms
20617.383 ms
20628.181 ms
On Tue, Jun 3, 2014 at 4:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Tue, May 6, 2014 at 12:04 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> >>> wrote: >>> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote: >>> >> Today, I discovered that when building a btree index, the btree code >>> >> uses index_form_tuple() to create an index tuple from the heap tuple, >>> >> calls tuplesort_putindextuple() to copy that tuple into the sort's >>> >> memory context, and then frees the original one it built. This seemed >>> >> inefficient, so I wrote a patch to eliminate the tuple copying. It >>> >> works by adding a function tuplesort_putindextuplevalues(), which >>> >> builds the tuple in the sort's memory context and thus avoids the need >>> >> for a separate copy. I'm not sure if that's the best approach, but >>> >> the optimization seems wortwhile. >>> > >>> > Hm. It looks like we could quite easily just get rid of >>> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert. >>> >>> I glanced at that, but it wasn't obvious to me how to convert the hash >>> usage. If you have an idea, I'm all ears. >> >> I also think it's possible to have similar optimization for hash index >> incase it has to spool the tuple for sorting. >> >> In function hashbuildCallback(), when buildstate->spool is true, we >> can avoid to form index tuple. To check for nulls before calling >> >> _h_spool(), we can traverse the isnull array. > > Hmm, that might work. Arguably it's less efficient, but on the other > hand if it avoids forming the tuple sometimes it might be MORE > efficient. And anyway the difference might not be enough to matter. On further review, this is definitely the way to go: it's a straight-up win. The isnull array is never more than one element in length, so testing the single element is quite trivial. The attached, revised patch provides a modest but useful speedup for both hash and btree index builds. Anyone see any reason NOT to do this? So far it looks like a slam-dunk from here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > On further review, this is definitely the way to go: it's a > straight-up win. The isnull array is never more than one element in > length, so testing the single element is quite trivial. The > attached, revised patch provides a modest but useful speedup for both > hash and btree index builds. > Anyone see any reason NOT to do this? So far it looks like a > slam-dunk from here. If index_form_tuple leaks any memory in the sort context, this would be not so much a slam-dunk win as a complete disaster. I'm not sure that no-leak is a safe assumption, especially in cases where we do toast compression of the index datums. (In fact, I'm pretty sure that the reason it was coded like this originally was exactly to avoid that assumption.) Assuming that you inspect the code and determine it's safe, the patch had better decorate index_form_tuple with dire warnings about not leaking memory; because even if it's safe today, somebody might break it tomorrow. On a micro-optimization level, it might be worth passing the TID as ItemPointer not ItemPointerData (ie, pass a pointer until we get to the point of actually inserting the TID into the index tuple). I'm not sure that copying odd-size structs should be assumed to be efficient. regards, tom lane
On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On further review, this is definitely the way to go: it's a >> straight-up win. The isnull array is never more than one element in >> length, so testing the single element is quite trivial. The >> attached, revised patch provides a modest but useful speedup for both >> hash and btree index builds. > >> Anyone see any reason NOT to do this? So far it looks like a >> slam-dunk from here. > > If index_form_tuple leaks any memory in the sort context, this would be > not so much a slam-dunk win as a complete disaster. I'm not sure that > no-leak is a safe assumption, especially in cases where we do toast > compression of the index datums. (In fact, I'm pretty sure that the > reason it was coded like this originally was exactly to avoid that > assumption.) Yes, that would be bad. The reason I think it's probably OK is that index_form_tuple seems to have had, from ancient times (cf. f67e79045), code of non-trivial to free any memory that's allocated as a result of de-TOASTing, which we presumably would not bother to do if we were counting on memory context resets to clean things up. It calls toast_compress_datum(), which is similarly careful. The most complicated code path appears to be the VARATT_IS_EXTERNAL() case, where we call heap_tuple_fetch_attr(), which calls toast_fetch_datum(). I'm kind of wondering whether that can really happen, though, because I didn't think indexes could use TOAST storage. Even if it does happen I don't see any obvious reason why it should leak, but it'd be harder to be certain that it doesn't. > Assuming that you inspect the code and determine it's safe, the patch > had better decorate index_form_tuple with dire warnings about not leaking > memory; because even if it's safe today, somebody might break it tomorrow. > > On a micro-optimization level, it might be worth passing the TID as > ItemPointer not ItemPointerData (ie, pass a pointer until we get to > the point of actually inserting the TID into the index tuple). > I'm not sure that copying odd-size structs should be assumed to be > efficient. Yeah, true. Checking existing precedent, it looks like we usually pass ItemPointer rather than ItemPointerData, so it's probably a good idea to do this that way too for reasons of style if nothing else. I kind of wonder whether it's really more efficient to pass an 8-byte pointer to a 6-byte structure than to just pass the structure itself, but it might be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On a micro-optimization level, it might be worth passing the TID as >> ItemPointer not ItemPointerData (ie, pass a pointer until we get to >> the point of actually inserting the TID into the index tuple). >> I'm not sure that copying odd-size structs should be assumed to be >> efficient. > Yeah, true. Checking existing precedent, it looks like we usually > pass ItemPointer rather than ItemPointerData, so it's probably a good > idea to do this that way too for reasons of style if nothing else. I > kind of wonder whether it's really more efficient to pass an 8-byte > pointer to a 6-byte structure than to just pass the structure itself, > but it might be. The pointer will certainly be passed in a register, or whatever passes for registers on the particular machine architecture. Weird-size structs, though, tend to have arcane and not-so-efficient rules for being passed by value. It's not unlikely that what the compiler will do under the hood is pass a pointer anyhow, and then do a memcpy to make a local copy in the called function. regards, tom lane
On Tue, Jun 17, 2014 at 10:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> On a micro-optimization level, it might be worth passing the TID as >>> ItemPointer not ItemPointerData (ie, pass a pointer until we get to >>> the point of actually inserting the TID into the index tuple). >>> I'm not sure that copying odd-size structs should be assumed to be >>> efficient. > >> Yeah, true. Checking existing precedent, it looks like we usually >> pass ItemPointer rather than ItemPointerData, so it's probably a good >> idea to do this that way too for reasons of style if nothing else. I >> kind of wonder whether it's really more efficient to pass an 8-byte >> pointer to a 6-byte structure than to just pass the structure itself, >> but it might be. > > The pointer will certainly be passed in a register, or whatever passes for > registers on the particular machine architecture. Weird-size structs, > though, tend to have arcane and not-so-efficient rules for being passed > by value. It's not unlikely that what the compiler will do under the hood > is pass a pointer anyhow, and then do a memcpy to make a local copy in > the called function. OK, committed with the suggested changes. (Thanks to Abhijit for pinging me about this, and more generally for his active and effective management of this CommitFest!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company