Thread: avoiding tuple copying in btree index builds

avoiding tuple copying in btree index builds

From

Robert Haas

Date:

05 May 2014, 17:52:45

Hi,

Today, I discovered that when building a btree index, the btree code
uses index_form_tuple() to create an index tuple from the heap tuple,
calls tuplesort_putindextuple() to copy that tuple into the sort's
memory context, and then frees the original one it built.  This seemed
inefficient, so I wrote a patch to eliminate the tuple copying.  It
works by adding a function tuplesort_putindextuplevalues(), which
builds the tuple in the sort's memory context and thus avoids the need
for a separate copy.  I'm not sure if that's the best approach, but
the optimization seems wortwhile.

I tested it by repeatedly executing "REINDEX INDEX
pgbench_accounts_pkey" on a PPC64 machine.  pgbench_accounts contains
10 million records.  With unpatched master as of
b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s,
6.177s, and 6.201s.  With the attached patch, I got times of 5.787s,
5.972s, and 5.913s, a savings of almost 5%.  Not bad considering the
amount of work involved.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

avoid-btree-build-tuple-copy-v1.patch

Re: avoiding tuple copying in btree index builds

From

Andres Freund

Date:

05 May 2014, 18:13:44

Hi,

On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
> Today, I discovered that when building a btree index, the btree code
> uses index_form_tuple() to create an index tuple from the heap tuple,
> calls tuplesort_putindextuple() to copy that tuple into the sort's
> memory context, and then frees the original one it built.  This seemed
> inefficient, so I wrote a patch to eliminate the tuple copying.  It
> works by adding a function tuplesort_putindextuplevalues(), which
> builds the tuple in the sort's memory context and thus avoids the need
> for a separate copy.  I'm not sure if that's the best approach, but
> the optimization seems wortwhile.

Hm. It looks like we could quite easily just get rid of
tuplesort_putindextuple(). The hash usage doesn't look hard to convert.

> I tested it by repeatedly executing "REINDEX INDEX
> pgbench_accounts_pkey" on a PPC64 machine.  pgbench_accounts contains
> 10 million records.  With unpatched master as of
> b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s,
> 6.177s, and 6.201s.  With the attached patch, I got times of 5.787s,
> 5.972s, and 5.913s, a savings of almost 5%.  Not bad considering the
> amount of work involved.

Yes, that's certainly worthwile. Nice.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: avoiding tuple copying in btree index builds

From

Robert Haas

Date:

05 May 2014, 18:35:02

On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
>> Today, I discovered that when building a btree index, the btree code
>> uses index_form_tuple() to create an index tuple from the heap tuple,
>> calls tuplesort_putindextuple() to copy that tuple into the sort's
>> memory context, and then frees the original one it built.  This seemed
>> inefficient, so I wrote a patch to eliminate the tuple copying.  It
>> works by adding a function tuplesort_putindextuplevalues(), which
>> builds the tuple in the sort's memory context and thus avoids the need
>> for a separate copy.  I'm not sure if that's the best approach, but
>> the optimization seems wortwhile.
>
> Hm. It looks like we could quite easily just get rid of
> tuplesort_putindextuple(). The hash usage doesn't look hard to convert.

I glanced at that, but it wasn't obvious to me how to convert the hash
usage.  If you have an idea, I'm all ears.

>> I tested it by repeatedly executing "REINDEX INDEX
>> pgbench_accounts_pkey" on a PPC64 machine.  pgbench_accounts contains
>> 10 million records.  With unpatched master as of
>> b2f7bd72c4d3e80065725c72e85778d5f4bdfd4a, I got times of 6.159s,
>> 6.177s, and 6.201s.  With the attached patch, I got times of 5.787s,
>> 5.972s, and 5.913s, a savings of almost 5%.  Not bad considering the
>> amount of work involved.
>
> Yes, that's certainly worthwile. Nice.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: avoiding tuple copying in btree index builds

From

Amit Kapila

Date:

01 June 2014, 07:26:32

On Tue, May 6, 2014 at 12:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
> >> Today, I discovered that when building a btree index, the btree code
> >> uses index_form_tuple() to create an index tuple from the heap tuple,
> >> calls tuplesort_putindextuple() to copy that tuple into the sort's
> >> memory context, and then frees the original one it built. This seemed
> >> inefficient, so I wrote a patch to eliminate the tuple copying. It
> >> works by adding a function tuplesort_putindextuplevalues(), which
> >> builds the tuple in the sort's memory context and thus avoids the need
> >> for a separate copy. I'm not sure if that's the best approach, but
> >> the optimization seems wortwhile.
> >
> > Hm. It looks like we could quite easily just get rid of
> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert.
>
> I glanced at that, but it wasn't obvious to me how to convert the hash
> usage. If you have an idea, I'm all ears.

I also think it's possible to have similar optimization for hash index

incase it has to spool the tuple for sorting.

In function hashbuildCallback(), when buildstate->spool is true, we

can avoid to form index tuple. To check for nulls before calling

_h_spool(), we can traverse the isnull array.

It seems converting hash index usage is not as straightforward as

btree index, but doesn't look too complex either.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: avoiding tuple copying in btree index builds

From

Robert Haas

Date:

03 June 2014, 20:38:19

On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, May 6, 2014 at 12:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
>> >> Today, I discovered that when building a btree index, the btree code
>> >> uses index_form_tuple() to create an index tuple from the heap tuple,
>> >> calls tuplesort_putindextuple() to copy that tuple into the sort's
>> >> memory context, and then frees the original one it built.  This seemed
>> >> inefficient, so I wrote a patch to eliminate the tuple copying.  It
>> >> works by adding a function tuplesort_putindextuplevalues(), which
>> >> builds the tuple in the sort's memory context and thus avoids the need
>> >> for a separate copy.  I'm not sure if that's the best approach, but
>> >> the optimization seems wortwhile.
>> >
>> > Hm. It looks like we could quite easily just get rid of
>> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert.
>>
>> I glanced at that, but it wasn't obvious to me how to convert the hash
>> usage.  If you have an idea, I'm all ears.
>
> I also think it's possible to have similar optimization for hash index
> incase it has to spool the tuple for sorting.
>
> In function hashbuildCallback(), when buildstate->spool is true, we
> can avoid to form index tuple. To check for nulls before calling
>
> _h_spool(), we can traverse the isnull array.

Hmm, that might work.  Arguably it's less efficient, but on the other
hand if it avoids forming the tuple sometimes it might be MORE
efficient.  And anyway the difference might not be enough to matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: avoiding tuple copying in btree index builds

From

Amit Kapila

Date:

08 June 2014, 06:40:13

On Wed, Jun 4, 2014 at 2:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

> > I also think it's possible to have similar optimization for hash index
> > incase it has to spool the tuple for sorting.
> >
> > In function hashbuildCallback(), when buildstate->spool is true, we
> > can avoid to form index tuple. To check for nulls before calling
> >
> > _h_spool(), we can traverse the isnull array.
>
> Hmm, that might work. Arguably it's less efficient, but on the other
> hand if it avoids forming the tuple sometimes it might be MORE
> efficient. And anyway the difference might not be enough to matter.

Considering the fact that for hash indexes, this optimization would

only get triggered for large indexes (atleast greater than shared buffer's),

I agree that it will not be as useful as it will be for btree indexes. The

another minor advantage could have been that we can remove

tuplesort_putindextuple() from code, if this optimization is done for hash

indexes. However we can leave it as it is for now as there doesn't seem

to be any gain by doing so.

You seem to have removed puttuple_common() call from function

tuplesort_putindextuple(), is there any reason for doing so?

Apart from that patch looks good to me.

Below performance data on various size of indexes shows that this

patch can improve performance upto ~7%. The performance

difference becomes lesser when the index size is too big and I think

that is probably due to the reason that we have to write all the data

at end of operation, so if the data is big the improvement is not shown

due to large I/O. The performance improvement is shown considering

median value of 3 runs and time is taken by enabling \timing option

of psql.

Performance Data

---------------------------

Configuration:

IBM POWER-7 16 cores, 64 hardware threads

RAM = 64GB

shared_buffers=8GB

pgbench_accounts

No. of Records -10million

Index - on integer

Operation - Reindex

Master

16043.500 ms

16058.723 ms

15941.057 ms

Patch

15525.054 ms

15551.935 ms

15492.879 ms

Perf Improvement: 3.43%

pgbench_accounts

No. of Records -30million

Index - on integer

Operation - Reindex

Master

51258.338 ms

50520.328 ms

50562.022 ms

Patch

49610.710 ms

49302.571 ms

49301.390 ms

Perf Improvement: 2.41%

table (c1 int, c2 char(10))

No. of records = 300,000

Index - on char(10)

Operation - Reindex

Master

443.584 ms

444.798 ms

452.888 ms

Patch

421.554 ms

430.528 ms

447.558 ms

Performance Improvement: 3.2%

table (c1 int, c2 char(10))

No. of records = 500,000

Index - on char(10)

Operation - Reindex

Master

663.621 ms

661.299 ms

657.754 ms

Patch

652.325 ms

644.782 ms

643.218 ms

Performance Improvement: 2.5%

table (c1 int, c2 char(10))

No. of records = 1000,000

Index - on char(10)

Operation - Reindex

Master

16554.076 ms

16686.528 ms

16571.129 ms

Patch

16556.852 ms

16513.543 ms

16610.615 ms

Performance Improvement: less than 1%

table (c1 int, c2 char(20))

No. of records = 300,000

Index - on char(20)

Operation - Reindex

Master

429.670 ms

441.445 ms

411.539 ms

Patch

401.801 ms

412.716 ms

395.002 ms

Performance Improvement: 6.48%

table (c1 int, c2 char(20))

No. of records = 500,000

Index - on char(20)

Operation - Reindex

Master

724.541 ms

731.582 ms

704.934 ms

Patch

686.004 ms

677.361 ms

686.915 ms

Performance Improvement: 5.31%

table (c1 int, c2 char(20))

No. of records = 1000,000

Index - on char(20)

Operation - Reindex

Master

20728.758 ms

20665.289 ms

20656.375 ms

Patch

20594.022 ms

20617.383 ms

20628.181 ms

Performance Improvement: 0.2%

Apart from this, I have taken data for unlogged tables and much

larger indexes and all the data shows similar results as above.

So I think above data is quite good representation of value for

this patch, however if you or anybody else still feels that more

data is required for any other configuration, please do let me

know about the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: avoiding tuple copying in btree index builds

From

Robert Haas

Date:

16 June 2014, 19:50:05

On Tue, Jun 3, 2014 at 4:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jun 1, 2014 at 3:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, May 6, 2014 at 12:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Mon, May 5, 2014 at 2:13 PM, Andres Freund <andres@2ndquadrant.com>
>>> wrote:
>>> > On 2014-05-05 13:52:39 -0400, Robert Haas wrote:
>>> >> Today, I discovered that when building a btree index, the btree code
>>> >> uses index_form_tuple() to create an index tuple from the heap tuple,
>>> >> calls tuplesort_putindextuple() to copy that tuple into the sort's
>>> >> memory context, and then frees the original one it built.  This seemed
>>> >> inefficient, so I wrote a patch to eliminate the tuple copying.  It
>>> >> works by adding a function tuplesort_putindextuplevalues(), which
>>> >> builds the tuple in the sort's memory context and thus avoids the need
>>> >> for a separate copy.  I'm not sure if that's the best approach, but
>>> >> the optimization seems wortwhile.
>>> >
>>> > Hm. It looks like we could quite easily just get rid of
>>> > tuplesort_putindextuple(). The hash usage doesn't look hard to convert.
>>>
>>> I glanced at that, but it wasn't obvious to me how to convert the hash
>>> usage.  If you have an idea, I'm all ears.
>>
>> I also think it's possible to have similar optimization for hash index
>> incase it has to spool the tuple for sorting.
>>
>> In function hashbuildCallback(), when buildstate->spool is true, we
>> can avoid to form index tuple. To check for nulls before calling
>>
>> _h_spool(), we can traverse the isnull array.
>
> Hmm, that might work.  Arguably it's less efficient, but on the other
> hand if it avoids forming the tuple sometimes it might be MORE
> efficient.  And anyway the difference might not be enough to matter.

On further review, this is definitely the way to go: it's a
straight-up win.  The isnull array is never more than one element in
length, so testing the single element is quite trivial.   The
attached, revised patch provides a modest but useful speedup for both
hash and btree index builds.

Anyone see any reason NOT to do this?  So far it looks like a
slam-dunk from here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

avoid-index-build-tuple-copy-v2.patch

Re: avoiding tuple copying in btree index builds

From

Tom Lane

Date:

17 June 2014, 00:11:00

Robert Haas <robertmhaas@gmail.com> writes:
> On further review, this is definitely the way to go: it's a
> straight-up win.  The isnull array is never more than one element in
> length, so testing the single element is quite trivial.   The
> attached, revised patch provides a modest but useful speedup for both
> hash and btree index builds.

> Anyone see any reason NOT to do this?  So far it looks like a
> slam-dunk from here.

If index_form_tuple leaks any memory in the sort context, this would be
not so much a slam-dunk win as a complete disaster.  I'm not sure that
no-leak is a safe assumption, especially in cases where we do toast
compression of the index datums.  (In fact, I'm pretty sure that the
reason it was coded like this originally was exactly to avoid that
assumption.)

Assuming that you inspect the code and determine it's safe, the patch
had better decorate index_form_tuple with dire warnings about not leaking
memory; because even if it's safe today, somebody might break it tomorrow.

On a micro-optimization level, it might be worth passing the TID as
ItemPointer not ItemPointerData (ie, pass a pointer until we get to
the point of actually inserting the TID into the index tuple).
I'm not sure that copying odd-size structs should be assumed to be
efficient.
        regards, tom lane

Re: avoiding tuple copying in btree index builds

From

Robert Haas

Date:

17 June 2014, 13:57:49

On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On further review, this is definitely the way to go: it's a
>> straight-up win.  The isnull array is never more than one element in
>> length, so testing the single element is quite trivial.   The
>> attached, revised patch provides a modest but useful speedup for both
>> hash and btree index builds.
>
>> Anyone see any reason NOT to do this?  So far it looks like a
>> slam-dunk from here.
>
> If index_form_tuple leaks any memory in the sort context, this would be
> not so much a slam-dunk win as a complete disaster.  I'm not sure that
> no-leak is a safe assumption, especially in cases where we do toast
> compression of the index datums.  (In fact, I'm pretty sure that the
> reason it was coded like this originally was exactly to avoid that
> assumption.)

Yes, that would be bad.  The reason I think it's probably OK is that
index_form_tuple seems to have had, from ancient times (cf.
f67e79045), code of non-trivial to free any memory that's allocated as
a result of de-TOASTing, which we presumably would not bother to do if
we were counting on memory context resets to clean things up.  It
calls toast_compress_datum(), which is similarly careful.  The most
complicated code path appears to be the VARATT_IS_EXTERNAL() case,
where we call heap_tuple_fetch_attr(), which calls
toast_fetch_datum().  I'm kind of wondering whether that can really
happen, though, because I didn't think indexes could use TOAST
storage.  Even if it does happen I don't see any obvious reason why it
should leak, but it'd be harder to be certain that it doesn't.

> Assuming that you inspect the code and determine it's safe, the patch
> had better decorate index_form_tuple with dire warnings about not leaking
> memory; because even if it's safe today, somebody might break it tomorrow.
>
> On a micro-optimization level, it might be worth passing the TID as
> ItemPointer not ItemPointerData (ie, pass a pointer until we get to
> the point of actually inserting the TID into the index tuple).
> I'm not sure that copying odd-size structs should be assumed to be
> efficient.

Yeah, true.  Checking existing precedent, it looks like we usually
pass ItemPointer rather than ItemPointerData, so it's probably a good
idea to do this that way too for reasons of style if nothing else.  I
kind of wonder whether it's really more efficient to pass an 8-byte
pointer to a 6-byte structure than to just pass the structure itself,
but it might be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: avoiding tuple copying in btree index builds

From

Tom Lane

Date:

17 June 2014, 14:08:39

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> On a micro-optimization level, it might be worth passing the TID as
>> ItemPointer not ItemPointerData (ie, pass a pointer until we get to
>> the point of actually inserting the TID into the index tuple).
>> I'm not sure that copying odd-size structs should be assumed to be
>> efficient.

> Yeah, true.  Checking existing precedent, it looks like we usually
> pass ItemPointer rather than ItemPointerData, so it's probably a good
> idea to do this that way too for reasons of style if nothing else.  I
> kind of wonder whether it's really more efficient to pass an 8-byte
> pointer to a 6-byte structure than to just pass the structure itself,
> but it might be.

The pointer will certainly be passed in a register, or whatever passes for
registers on the particular machine architecture.  Weird-size structs,
though, tend to have arcane and not-so-efficient rules for being passed
by value.  It's not unlikely that what the compiler will do under the hood
is pass a pointer anyhow, and then do a memcpy to make a local copy in
the called function.
        regards, tom lane

Re: avoiding tuple copying in btree index builds

From

Robert Haas

Date:

01 July 2014, 14:39:11

On Tue, Jun 17, 2014 at 10:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jun 16, 2014 at 8:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> On a micro-optimization level, it might be worth passing the TID as
>>> ItemPointer not ItemPointerData (ie, pass a pointer until we get to
>>> the point of actually inserting the TID into the index tuple).
>>> I'm not sure that copying odd-size structs should be assumed to be
>>> efficient.
>
>> Yeah, true.  Checking existing precedent, it looks like we usually
>> pass ItemPointer rather than ItemPointerData, so it's probably a good
>> idea to do this that way too for reasons of style if nothing else.  I
>> kind of wonder whether it's really more efficient to pass an 8-byte
>> pointer to a 6-byte structure than to just pass the structure itself,
>> but it might be.
>
> The pointer will certainly be passed in a register, or whatever passes for
> registers on the particular machine architecture.  Weird-size structs,
> though, tend to have arcane and not-so-efficient rules for being passed
> by value.  It's not unlikely that what the compiler will do under the hood
> is pass a pointer anyhow, and then do a memcpy to make a local copy in
> the called function.

OK, committed with the suggested changes.

(Thanks to Abhijit for pinging me about this, and more generally for
his active and effective management of this CommitFest!)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company