Thread: Improving the Performance of Full Table Updates

Improving the Performance of Full Table Updates

From
"Gokulakannan Somsundaram"
Date:
Hi,<br />   The Current architecture of Updates in PostgreSQL is <br />1) Make a select query out of update. It
involvesa READ lock/BUFFER_SHARE<br />2) Get the tupleid<br />3) Goto the buffer containing the tupleid, make a
BUFFER_EXCLUSIVElock on it <br />4) update it<br />5) Repeat the above process for subsequent rows<br /><br />I propose
tochange this row-by-row approach, when it is a full table update. I plan to send a extra flag(which will be set for
Fulltable Deletes/Updates). this would make the access method directly acquire the exclusive lock and update the
existingrecord. <br /><br />For Deletes this is simple. But for updates, the projection tuple has to be made before
re-insertingit. So there will be a list of Heap tuples stored in memory for each page getting updated. these tuples
willbe inserted after the deletion part of update is done. This is just a rough design. I may get involved in a detail
designonce i get a nod from the mailing list community. <br /><br />Thanks,<br />Gokul.<br /> 

Re: Improving the Performance of Full Table Updates

From
"Heikki Linnakangas"
Date:
Gokulakannan Somsundaram wrote:
> Hi,
>    The Current architecture of Updates in PostgreSQL is
> 1) Make a select query out of update. It involves a READ lock/BUFFER_SHARE
> 2) Get the tupleid
> 3) Goto the buffer containing the tupleid, make a BUFFER_EXCLUSIVE lock on
> it
> 4) update it
> 5) Repeat the above process for subsequent rows
> 
> I propose to change this row-by-row approach, when it is a full table
> update. I plan to send a extra flag(which will be set for Full table
> Deletes/Updates). this would make the access method directly acquire the
> exclusive lock and update the existing record.
> 
> For Deletes this is simple. But for updates, the projection tuple has to be
> made before re-inserting it. So there will be a list of Heap tuples stored
> in memory for each page getting updated. these tuples will be inserted after
> the deletion part of update is done. This is just a rough design. I may get
> involved in a detail design once i get a nod from the mailing list
> community.

I doubt the locking overhead is that significant. Have you done any
profiling to show that it's worth it?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Improving the Performance of Full Table Updates

From
"Gokulakannan Somsundaram"
Date:
The obvious advantages are
a) Avoidance of one read lock per page
b) One Big write lock instead of multiple write locks.

But as you said, i will do some initial profiling and get back.

Thanks,
Gokul.

On 9/20/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Gokulakannan Somsundaram wrote:
> Hi,
>    The Current architecture of Updates in PostgreSQL is
> 1) Make a select query out of update. It involves a READ lock/BUFFER_SHARE
> 2) Get the tupleid
> 3) Goto the buffer containing the tupleid, make a BUFFER_EXCLUSIVE lock on
> it
> 4) update it
> 5) Repeat the above process for subsequent rows
>
> I propose to change this row-by-row approach, when it is a full table
> update. I plan to send a extra flag(which will be set for Full table
> Deletes/Updates). this would make the access method directly acquire the
> exclusive lock and update the existing record.
>
> For Deletes this is simple. But for updates, the projection tuple has to be
> made before re-inserting it. So there will be a list of Heap tuples stored
> in memory for each page getting updated. these tuples will be inserted after
> the deletion part of update is done. This is just a rough design. I may get
> involved in a detail design once i get a nod from the mailing list
> community.

I doubt the locking overhead is that significant. Have you done any
profiling to show that it's worth it?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Improving the Performance of Full Table Updates

From
Tom Lane
Date:
"Gokulakannan Somsundaram" <gokul007@gmail.com> writes:
> I propose to change this row-by-row approach, when it is a full table
> update. I plan to send a extra flag(which will be set for Full table
> Deletes/Updates). this would make the access method directly acquire the
> exclusive lock and update the existing record.

This sounds like a recipe for utter destruction of what little
modularity and layering we've got left.  And I rather doubt it will buy
anything interesting performance-wise.

To cite just one concrete objection: surely the tuple-fetch code has got
no business firing triggers.
        regards, tom lane


Re: Improving the Performance of Full Table Updates

From
"Gokulakannan Somsundaram"
Date:
Hi Tom,
      Thanks for the feedback. Let me clarify my intention again.
This was thought for improving the performance of the Data Warehousing applications

Full table updates similar to
"Update dd set n2=n2+1"

When you talked about firing triggers, i looked into the implementation of triggers and the approach i suggested may not work fine with the Triggers. Since we cannot hold write locks and fire triggers, triggers should get disabled for this. But Remember in Data Warehousing applications, people won't be usually having Row-level triggers.

My alternate suggestion would be to make this kind of update an optional one to provide the speed up.

If a page contains 100 rows, then the current scenario takes 1 read lock + 100 write locks to complete the full table update. In the suggested scenario, it takes one Write Lock. Also it reduces the 101 Logical I/Os to1 Logical I/O. This might provide the same kind of benefit the Bitmap Index Scan provided.

Again there are some specific cases
a) If the tuple is locked / Concurrently being updated:
                    Currently we release the buffer, take a lock on the tuple and wait for the transaction to complete in case of concurrent updates. In the Full table update also, we will do the same.

b) If the page contains lot of deleted tuples:
                     This is a tricky scenario. Say if we have 100 tuples and we have 20% of them deleted. In the current scenario, we will find that out during the read lock and we will not waste time in those tuples during the write lock. But in the suggested scenario, we will be wasting time in those with the write lock on the buffer. In order to circumvent that, we can resort to a one read lock + one write lock combination.

Again if this full table updates are thought with the OLTP applications in mind, then this is not at all a suitable option. This will only benefit the people with Data Warehouses.

Expecting some more replies....


Thanks,
Gokul.




On 9/20/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Gokulakannan Somsundaram" <gokul007@gmail.com> writes:
> I propose to change this row-by-row approach, when it is a full table
> update. I plan to send a extra flag(which will be set for Full table
> Deletes/Updates). this would make the access method directly acquire the
> exclusive lock and update the existing record.

This sounds like a recipe for utter destruction of what little
modularity and layering we've got left.  And I rather doubt it will buy
anything interesting performance-wise.

To cite just one concrete objection: surely the tuple-fetch code has got
no business firing triggers.

                        regards, tom lane

Re: Improving the Performance of Full Table Updates

From
"Heikki Linnakangas"
Date:
Gokulakannan Somsundaram wrote:
> Again if this full table updates are thought with the OLTP applications in
> mind, then this is not at all a suitable option. This will only benefit the
> people with Data Warehouses.
> 
> Expecting some more replies....

Start with profiling.

I just ran a quick oprofile run of a full-table UPDATE on a simple table
with one index, and it looks like RelationGetBufferForTuple uses 4.53%
of the CPU time. Out of that, 2.86 percentage points are spent in
ReadBuffer_common. That means that write-locking the heap pages takes at
most 4.53 - 2.86 = 1.67 % of the total CPU time.

That's the upper limit of the benefit from the scheme you're proposing.
Surely the effort would be better spent on something else. For example,
if you kept the insertion target page just pinned over the calls, which
wouldn't have the problems with triggers etc, you could save that 2.86%.
Which still isn't much. Or take a look at WAL logging. XLogInsert took
16.06% of the CPU time. Earlier tests have suggested that a big chunk of
that time is spent in CRC calculation. Alternative CRC methods have been
suggested in the past, or perhaps that could time could be offloaded to
the WAL writer process, speeding up the UPDATE on a multi-CPU server.

Also, if we're talking about data warehousing, we're talking about big
tables that don't fit in memory. That means that you're likely
bottlenecked by I/O speed, not CPU. If that's the case, saving some CPU
time makes no difference whatsoever. What would help with I/O bottleneck
is to try to make the disk footprint smaller, or make better use of the
I/O bandwidth available.

Three steps to improve throughput:

1. Identify the hardware component that's the bottleneck.
2. Profile the workload to see what's using the bottlenecked resource
the most.
3. Figure out how to make that piece of code cheaper.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Improving the Performance of Full Table Updates

From
"Gokulakannan Somasundaram"
Date:
Hi Tom/ Heikki,
           Thanks for the suggestion. After profiling i got similar results. So i am thinking of a design like this to get the performance improvement.

a) We can get one page for insert(during update)  and we will hold the write lock on it, till the page gets filled. In this way, RelationGetBufferForTuple will get called only once for one page of inserts.

b) Do you think if we can optimize the XlogInsert in such a way, it will write a page instead of writing all the records in the page.  I think we need to write a recovery routine for the same. Currently the page gets flushed to the WAL, if it gets modified after the checkpoint. So i still need to understand those code pieces. But do you think it is wise to continue working on this line?

Thanks,
Gokul.

On 9/21/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Gokulakannan Somsundaram wrote:
> Again if this full table updates are thought with the OLTP applications in
> mind, then this is not at all a suitable option. This will only benefit the
> people with Data Warehouses.
>
> Expecting some more replies....

Start with profiling.

I just ran a quick oprofile run of a full-table UPDATE on a simple table
with one index, and it looks like RelationGetBufferForTuple uses 4.53%
of the CPU time. Out of that, 2.86 percentage points are spent in
ReadBuffer_common. That means that write-locking the heap pages takes at
most 4.53 - 2.86 = 1.67 % of the total CPU time.

That's the upper limit of the benefit from the scheme you're proposing.
Surely the effort would be better spent on something else. For example,
if you kept the insertion target page just pinned over the calls, which
wouldn't have the problems with triggers etc, you could save that 2.86%.
Which still isn't much. Or take a look at WAL logging. XLogInsert took
16.06% of the CPU time. Earlier tests have suggested that a big chunk of
that time is spent in CRC calculation. Alternative CRC methods have been
suggested in the past, or perhaps that could time could be offloaded to
the WAL writer process, speeding up the UPDATE on a multi-CPU server.

Also, if we're talking about data warehousing, we're talking about big
tables that don't fit in memory. That means that you're likely
bottlenecked by I/O speed, not CPU. If that's the case, saving some CPU
time makes no difference whatsoever. What would help with I/O bottleneck
is to try to make the disk footprint smaller, or make better use of the
I/O bandwidth available.

Three steps to improve throughput:

1. Identify the hardware component that's the bottleneck.
2. Profile the workload to see what's using the bottlenecked resource
the most.
3. Figure out how to make that piece of code cheaper.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Improving the Performance of Full Table Updates

From
Heikki Linnakangas
Date:
Gokulakannan Somasundaram wrote:
> Hi Tom/ Heikki,
>            Thanks for the suggestion. After profiling i got similar results.
> So i am thinking of a design like this to get the performance improvement.
> 
> a) We can get one page for insert(during update)  and we will hold the write
> lock on it, till the page gets filled. In this way,
> RelationGetBufferForTuple will get called only once for one page of inserts.

The locking actually isn't that expensive once you have the page pinned.
For starters, keep the page pinned over calls to heap_update, and just
relock it instead of calling RelationGetBufferForTuple. Unsurprisingly,
this is not a new idea:
http://archives.postgresql.org/pgsql-patches/2007-05/msg00499.php.

> b) Do you think if we can optimize the XlogInsert in such a way, it will
> write a page instead of writing all the records in the page.  I think we
> need to write a recovery routine for the same. Currently the page gets
> flushed to the WAL, if it gets modified after the checkpoint. So i still
> need to understand those code pieces. But do you think it is wise to
> continue working on this line?

It's going to be very difficult at least. There's a lot of race
conditions lurking if you try to coalesce multiple updates to a single
WAL record.

That said, making XLogInsert faster would help a lot of use cases, not
only full-table udpates. Most of the time is spent calculating the CRC,
but it has some other non-trivial overhead as well. Profiling XLogInsert
in more detail, and figuring out how to make it faster would be very nice.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Improving the Performance of Full Table Updates

From
"Gokulakannan Somasundaram"
Date:
Hi Heikki,
           Finally i found some time to look more into the CRC code. The time is mostly taken when there is a back-up block in the XLOG structure. when it calculates the CRC for the entire block(there is some optimization already for the holes), the time is spent on the CRC macro. I tried doing some small optimization like changing the int into uint8, thinking that the exponentiation operation might get slightly faster, with no success. I don't know whether changing the CRC algorithm is a good option.
           I have other observations on including the snapshot information into the indexes, on which i will make a proposal. Please provide me guidance on that.

Thanks,
Gokul.

On 9/27/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Gokulakannan Somasundaram wrote:
> Hi Tom/ Heikki,
>            Thanks for the suggestion. After profiling i got similar results.
> So i am thinking of a design like this to get the performance improvement.
>
> a) We can get one page for insert(during update)  and we will hold the write
> lock on it, till the page gets filled. In this way,
> RelationGetBufferForTuple will get called only once for one page of inserts.

The locking actually isn't that expensive once you have the page pinned.
For starters, keep the page pinned over calls to heap_update, and just
relock it instead of calling RelationGetBufferForTuple. Unsurprisingly,
this is not a new idea:
http://archives.postgresql.org/pgsql-patches/2007-05/msg00499.php.

> b) Do you think if we can optimize the XlogInsert in such a way, it will
> write a page instead of writing all the records in the page.  I think we
> need to write a recovery routine for the same. Currently the page gets
> flushed to the WAL, if it gets modified after the checkpoint. So i still
> need to understand those code pieces. But do you think it is wise to
> continue working on this line?

It's going to be very difficult at least. There's a lot of race
conditions lurking if you try to coalesce multiple updates to a single
WAL record.

That said, making XLogInsert faster would help a lot of use cases, not
only full-table udpates. Most of the time is spent calculating the CRC,
but it has some other non-trivial overhead as well. Profiling XLogInsert
in more detail, and figuring out how to make it faster would be very nice.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: Improving the Performance of Full Table Updates

From
"Heikki Linnakangas"
Date:
Gokulakannan Somasundaram wrote:
>            Finally i found some time to look more into the CRC code. The
> time is mostly taken when there is a back-up block in the XLOG structure.
> when it calculates the CRC for the entire block(there is some optimization
> already for the holes), the time is spent on the CRC macro. I tried doing
> some small optimization like changing the int into uint8, thinking that the
> exponentiation operation might get slightly faster, with no success. I don't
> know whether changing the CRC algorithm is a good option.

Yeah, I believe the current implementation is quite well-optimized, so
it's unlikely that you'll find any micro-optimizations like that that
make any significant difference. I'm afraid you're going to need bigger
changes to make it faster. Some ideas that I've had in the past that
might be worth investigating:

* Use a much bigger WAL block size, to avoid splitting each big record,
like one with a backup block. One block / one WAL segment might not be a
bad idea.

* mmap the WAL segments, instead of using the slru buffers. Even if it
didn't help with performance per se, we could get rid of the wal_buffers
setting and let the OS manage the caching.

* Use a cheaper CRC algorithm.

* Calculate CRC when flushing, instead of inserting, the WAL. This would
allow the WAL writer to pick up much the of the CRC work, increasing
throughput on multi-core systems, particularly on bulk loads.

* Cover multiple WAL records with the CRC. Presumably calculating the
CRC of a large contiguous chunk cheaper than n smaller chunks.

* Allow buffering of multiple WAL segments, to avoid the WAL flush on
every xlog segment switch.

* Reorder WAL records on the fly, so that if you have something like
"full page image + update + update + update + update" all happening on a
single page, we could skip writing the individual update records, and
take a full page image of the page after the last update instead.
(getting all the locking right is quite tricky...)

* Change the locking so that you don't need to hold the WALInsertLock
while memcpying the record to the WAL block, but just to allocate the
slot for it. Wouldn't increase single-process performance, but would
make it more scalable.

I haven't pursued any of these further, so some or all of them might
turn out to be not helpful or not feasible.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Improving the Performance of Full Table Updates

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> * mmap the WAL segments, instead of using the slru buffers.

This one's almost certainly a nonstarter, for lack of any portable way
to control when writes happen.
        regards, tom lane


Re: Improving the Performance of Full Table Updates

From
"Heikki Linnakangas"
Date:
Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> * mmap the WAL segments, instead of using the slru buffers.
> 
> This one's almost certainly a nonstarter, for lack of any portable way
> to control when writes happen.

For flushing, there's msync, which I believe is quite portable.

We mustn't write data pages before the WAL hits the disk, but we don't
have any such limitation for WAL.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Improving the Performance of Full Table Updates

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> We mustn't write data pages before the WAL hits the disk, but we don't
> have any such limitation for WAL.

Um, you're right, I was remembering discussions about trying to mmap
data files.  Still, googling "msync performance" leaves me full of
concern about whether this would be a portable improvement.
        regards, tom lane


Re: Improving the Performance of Full Table Updates

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> * mmap the WAL segments, instead of using the slru buffers.
>
> This one's almost certainly a nonstarter, for lack of any portable way
> to control when writes happen.

I think mlock and msync(MS_SYNC) ought to be enough. I'm not sure every OS
implements them as specified though.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com