Thread: Re: batch update query performance

Re: batch update query performance

From
Hans Drexler
Date:
Dear Jeff, Albe and Heikki,

Let me start by thanking you for your time. It is really nice to have a
real supportive community. Thank you.

After reading the answers, we decided to do an experiment with a
fillfactor of 40% and dropping the index on the is_grc_002 field (but
retaining the other indexes.) The experiment showed a reduction in
run-time to ~125 seconds. That is almost 25 times faster than it was. We
are now doing more tests to verify this fix. We will send a SOLVED
message when the fix is verified (unless you state to not bother...)

We think we understand why the improvement works. Let me state our
understanding here. Please comment if we got it wrong.

Index entries point to record pages. An update on a row results in a new
row instance. If the new instance can be written in the same page as the
old instance, then no indexes need to be updated because the index still
points to the correct page. (Unless the update itself modifies an
indexed value). By specifying a fillfactor of 40%, there will be room
for an updated version of each row in the page.
We assume (sorry) that vacuuming the table will release the space of the
old rows, so that we can again do an update query and reuse the freed up
space in the pages.


Jeff, answering your question: The update is done after each cycle. It
will actually also update rows that were already updated before. We
realize this is actually wasteful.


So we might change

update t67cdi_nl_cmp_descr set is_grc_002='Y'

to

update t67cdi_nl_cmp_descr set is_grc_002='Y' where is_grc_002 is null

This will avoid creating new records for records that where already
changed before. This might give us additional speed improvement.


Albe, answering your question: Yes, the update was indeed finished in ~3
minutes when all indexes were dropped. Fifteen indexes is indeed a big
number. These indexes are configured by consultants on a
project-by-project basis. They are not hard coded in the software. I
will however advice the consultant to have a critical look at the big
number of indexes used in this case.

Heikki, Thank you for the nifty techniques. I especially like the
possible solution with partitioning the table and using a view. We don't
think we can do this at this point. Let me elaborate a bit to explain
why.

The application works in cycles. Each cycles adds more records to the
table. New records have a NULL value in field is_grc_002. At the end of
the cycle, the value of all existing records is changed to 'Y'. During
the cycle, some records have is_grc_002 NULL and other have value 'Y'.
This is used for the processing. A view can only work if all rows have a
single value. We could use two tables (one with records from old cycles,
one with records from the new cycle) and a view. But that means copying
all records from the "new" to the "older" table in each cycle.

Kind regards,

Hans Drexler





Re: batch update query performance

From
Jeff Janes
Date:

On Tue, Apr 8, 2014 at 6:21 AM, Hans Drexler <Hans.Drexler@humaninference.com> wrote:
Dear Jeff, Albe and Heikki,

Let me start by thanking you for your time. It is really nice to have a
real supportive community. Thank you.

After reading the answers, we decided to do an experiment with a
fillfactor of 40% and dropping the index on the is_grc_002 field (but
retaining the other indexes.) The experiment showed a reduction in
run-time to ~125 seconds. That is almost 25 times faster than it was. We
are now doing more tests to verify this fix. We will send a SOLVED
message when the fix is verified (unless you state to not bother...)

We think we understand why the improvement works. Let me state our
understanding here. Please comment if we got it wrong.

Index entries point to record pages. An update on a row results in a new
row instance. If the new instance can be written in the same page as the
old instance, then no indexes need to be updated because the index still
points to the correct page. (Unless the update itself modifies an
indexed value). By specifying a fillfactor of 40%, there will be room
for an updated version of each row in the page.

This is mostly correct.  The index entry does not have *just* a page, it also has an offset to a slot on that page.  However, once it gets to the page there is a mechanism for chaining slots together, so you can still find the new version given the slot of an older version on the same page.  (If there were a way to have the index store *just* the page, then it would be even more useful for HOT, as then only the indexes for the values actually changed would need to get updates, as opposed to now where every index needs to be updated if any index needs to be updated.  But that would have other trade-offs)

 
We assume (sorry) that vacuuming the table will release the space of the
old rows, so that we can again do an update query and reuse the freed up
space in the pages.

Once no transaction can possibly be interested in the old version, then a vacuum can free it up for reuse.  In the special case of old HOT-updated tuples, any other process that happens to visit the page can also clean them up once they are old enough, not just vacuums.  

But if someone has a long running transaction open, even if that transaction never has and never will touch the table being vacuumed, it will still prevent the space from being reused.
 


Jeff, answering your question: The update is done after each cycle. It
will actually also update rows that were already updated before. We
realize this is actually wasteful.


So we might change

update t67cdi_nl_cmp_descr set is_grc_002='Y'

to

update t67cdi_nl_cmp_descr set is_grc_002='Y' where is_grc_002 is null

This will avoid creating new records for records that where already
changed before. This might give us additional speed improvement.


That will probably help a lot.  The HOT update code is smart enough to realize that changing from 'Y' to 'Y' does not prevent the HOT update from working, but it still needs to find room on the same page for a new copy of the tuple or else it cannot use the HOT mechanism anyway.  Once you add this restriction to the where clause, you might find that it is better to leave the index in place and put up with the index updates for those rows which actually do need to be updated, rather than keep dropping the and rebuilding the index.  It would depend on what proportion of the table is getting updated each time.

Cheers,

Jeff