Thread: questions about PG update performance

questions about PG update performance

From
Kisung Kim
Date:
Dear,

I have a question about update performance of PG.

Because of the internal implementation of MVCC in PG
the update of a row is actually a insertion of a new version row.
So if the size of a row is huge, then it incurs some overhead compare to in-place update strategy.

Let's assume that a table has 200 columns, 
and a user updates one of the columns of one row in the table.
Then PG will rewrite the whole contents of the updated row
including the updated columns and not-updated columns.

I'm not sure about the implementation of Oracle's update.
But if the Oracle can overwrite only the updated column,
the performance difference between Oracle and PG in that case may be significant.

I researched about this issues in mailing list and google.
But I've not found anything related to this issues.

Thank you.

Kisung Kim.
 

                                                                                                                                                       


(C)Bitnine, Kisung Kim, Ph.D
https://sites.google.com/site/kisungresearch/
E-mail : kskim@bitnine.co.kr
Office phone : 070-4800-3321
Mobile phone : 010-7136-0834
Fax : 02-568-1332

Re: questions about PG update performance

From
Michael Paquier
Date:


On Mon, Oct 26, 2015 at 10:52 AM, Kisung Kim wrote:
Because of the internal implementation of MVCC in PG
the update of a row is actually a insertion of a new version row.
So if the size of a row is huge, then it incurs some overhead compare to in-place update strategy.

Yeah, that's how an UPDATE in Postgres for MVCC usage. The xmax of the old row is updated, and a new row is inserted with an xmin equal to the previous xmax. So if you update tuple fields one by one the cost is going to be high.
 
Let's assume that a table has 200 columns, 
and a user updates one of the columns of one row in the table.
Then PG will rewrite the whole contents of the updated row
including the updated columns and not-updated columns.

When a table has a large number of columns, usually I would say that you have a normalization problem and such schemas could be split into a smaller set of tables, minimizing the UPDATE cost.
 
I'm not sure about the implementation of Oracle's update.
But if the Oracle can overwrite only the updated column,
the performance difference between Oracle and PG in that case may be significant.

I researched about this issues in mailing list and google.
But I've not found anything related to this issues.

What you are looking at here is columnar storage, Alvaro and 2nd Quadrant folks have been doing some work in this area recently:
http://www.postgresql.org/message-id/20150831225328.GM2912@alvherre.pgsql
Also, you may want to have a look at cstore_fdw: https://github.com/citusdata/cstore_fdw.
Regards,
--
Michael

Re: questions about PG update performance

From
Kisung Kim
Date:

2015-10-26 11:12 GMT+09:00 Michael Paquier <michael.paquier@gmail.com>:


On Mon, Oct 26, 2015 at 10:52 AM, Kisung Kim wrote:
Because of the internal implementation of MVCC in PG
the update of a row is actually a insertion of a new version row.
So if the size of a row is huge, then it incurs some overhead compare to in-place update strategy.

Yeah, that's how an UPDATE in Postgres for MVCC usage. The xmax of the old row is updated, and a new row is inserted with an xmin equal to the previous xmax. So if you update tuple fields one by one the cost is going to be high.
 
Let's assume that a table has 200 columns, 
and a user updates one of the columns of one row in the table.
Then PG will rewrite the whole contents of the updated row
including the updated columns and not-updated columns.

When a table has a large number of columns, usually I would say that you have a normalization problem and such schemas could be split into a smaller set of tables, minimizing the UPDATE cost.
 
I'm not sure about the implementation of Oracle's update.
But if the Oracle can overwrite only the updated column,
the performance difference between Oracle and PG in that case may be significant.

I researched about this issues in mailing list and google.
But I've not found anything related to this issues.

What you are looking at here is columnar storage, Alvaro and 2nd Quadrant folks have been doing some work in this area recently:
http://www.postgresql.org/message-id/20150831225328.GM2912@alvherre.pgsql
Also, you may want to have a look at cstore_fdw: https://github.com/citusdata/cstore_fdw.
Regards,
--
Michael

Thank you for your reply.
I already know about the column store and that it is optimized for the case I described.

However, what I want to know is about the update performance difference between PG and Oracle if there any.
The case I described is for exaggerating the difference between PG and Oracle.

I want to explain for our clients that PG's update performance is comparable to Oracle's.

Regards, 



Re: questions about PG update performance

From
Любен Каравелов
Date:
----- Цитат от Kisung Kim (kskim@bitnine.co.kr), на 26.10.2015 в 04:36 -----

> However, what I want to know is about the update performance difference
> between PG and Oracle if there any.
> The case I described is for exaggerating the difference between PG and
> Oracle.
>
> I want to explain for our clients that PG's update performance is
> comparable to Oracle's.
>

Oracle is also using MVCC but copies the old row in the rollback segment and
rewrites the values in-place. It is still 2 writes as in Postgres. The
difference is on roll-back and cleaning the row when it is not needed
anymore.

Regards,

--
Luben Karavelov




Re: questions about PG update performance

From
Amit Kapila
Date:
On Mon, Oct 26, 2015 at 9:03 AM, Любен Каравелов <karavelov@mail.bg> wrote:
>
>
> ----- Цитат от Kisung Kim (kskim@bitnine.co.kr), на 26.10.2015 в 04:36 -----
>
> > However, what I want to know is about the update performance difference
> > between PG and Oracle if there any.
> > The case I described is for exaggerating the difference between PG and
> > Oracle.
> >
> > I want to explain for our clients that PG's update performance is
> > comparable to Oracle's.
> >
>
> Oracle is also using MVCC but copies the old row in the rollback segment and
> rewrites the values in-place.

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.

In PostgreSQL, the whole new row is written in heap and diff tuple (difference
of old and new tuple; this optimization is done in 9.4, commit a3115f0d)
in WAL.  I think we can try to optimize and use the same technique for
heap as used for WAL to make PostgreSQL more efficient for such scenario's,
however as of today, my guess is that PostgreSQL's update would be lagging
in this area.

> It is still 2 writes as in Postgres.

The difference is in the amount of data written per write.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: questions about PG update performance

From
Ashutosh Bapat
Date:


On Mon, Oct 26, 2015 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Oct 26, 2015 at 9:03 AM, Любен Каравелов <karavelov@mail.bg> wrote:
>
>
> ----- Цитат от Kisung Kim (kskim@bitnine.co.kr), на 26.10.2015 в 04:36 -----
>
> > However, what I want to know is about the update performance difference
> > between PG and Oracle if there any.
> > The case I described is for exaggerating the difference between PG and
> > Oracle.
> >
> > I want to explain for our clients that PG's update performance is
> > comparable to Oracle's.
> >
>
> Oracle is also using MVCC but copies the old row in the rollback segment and
> rewrites the values in-place.

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.


In that case, readers would pay the penalty for constructing the row. PostgreSQL will not incur the cost of reconstruction. Either writer or reader is bound to pay penalty. If the user's load is reader heavy it makes sense to use something like PG, else something like what is described above.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: questions about PG update performance

From
Amit Kapila
Date:
On Mon, Oct 26, 2015 at 12:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:


On Mon, Oct 26, 2015 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.


In that case, readers would pay the penalty for constructing the row.

Readers that have snapshot older than update-transaction needs to
pay such cost, otherwise all newer transactions can directly read from
page.  Also not all old-transaction readers have to pay any such cost.

Not only that, such a design has an advantage that the bloat due to
older data won't be there.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: questions about PG update performance

From
Ashutosh Bapat
Date:


On Mon, Oct 26, 2015 at 4:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Oct 26, 2015 at 12:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:


On Mon, Oct 26, 2015 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.


In that case, readers would pay the penalty for constructing the row.

Readers that have snapshot older than update-transaction needs to
pay such cost, otherwise all newer transactions can directly read from
page.  Also not all old-transaction readers have to pay any such cost.


Can you please explain your last sentence?
 

Not only that, such a design has an advantage that the bloat due to
older data won't be there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: questions about PG update performance

From
José Luis Tallón
Date:
On 10/26/2015 05:49 AM, Amit Kapila wrote:
On Mon, Oct 26, 2015 at 9:03 AM, Любен Каравелов <karavelov@mail.bg> wrote:
>
>
> ----- Цитат от Kisung Kim (kskim@bitnine.co.kr), на 26.10.2015 в 04:36 -----
>
> > However, what I want to know is about the update performance difference
> > between PG and Oracle if there any.
> > The case I described is for exaggerating the difference between PG and
> > Oracle.
> >
> > I want to explain for our clients that PG's update performance is
> > comparable to Oracle's.
> >
>
> Oracle is also using MVCC but copies the old row in the rollback segment and
> rewrites the values in-place.

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.

IMV, where Oracle is heavily optimized for "most DML transactions will commit successfully" and "no long-running transactions shall ever exists" / "not many transactions will have to read previous snapshots"(based on PI), Postgres does not actually make any such assumptions.

    Hence, for long running transactions / massive concurrency-many clients reading and writing older snapshots, Postgres will be faster (less work to do compared to re-constructing rows based on PIs)

    Plus, for updates where the size of the NEW row is bigger than the previous one (think adding text) the overhead is actually greater for Oracle (plus, they don't compress variable length values by default / no TOAST )... so here Postgres would be faster.
    For text-intensive workloads, Postgres is measurably faster than Oracle mostly due to this fact (plus much more efficient in it use of storage/RAM...)


In PostgreSQL, the whole new row is written in heap and diff tuple (difference
of old and new tuple; this optimization is done in 9.4, commit a3115f0d)
in WAL.  I think we can try to optimize and use the same technique for
heap as used for WAL to make PostgreSQL more efficient for such scenario's,
however as of today, my guess is that PostgreSQL's update would be lagging
in this area.

Yup. But see above for a potential reason where it might not be that bad, especially after the optimization you mention.

> It is still 2 writes as in Postgres.

The difference is in the amount of data written per write.

Yes, but compressed (for varlena-based datum/data), batched (group-commit) so mostly sequential, and non-duplicated (WAL vs REDO+UNDO).

So I guess the difference is quite small nowadays, and differences will be heavily influenced by actual workload.


Just my 2 (euro-) cents.


    / J.L.

Re: questions about PG update performance

From
Jim Nasby
Date:
On 10/25/15 9:36 PM, Kisung Kim wrote:
> I want to explain for our clients that PG's update performance is
> comparable to Oracle's.

There's really only 2 ways you can answer that. You can either handwave 
the question away ("Yes, update performance is comparable."), or you 
have to do actual benchmarking. Trying to answer this from a theoretical 
standpoint is completely useless because there's an absurd number of 
things that will affect this:

Number of columns
Data types
Size of overall transaction
Percent of transactions that roll back
Size of table
What % of table is updated every day
Underlying hardware
What OS the database is running on
What filesystem the database is running on

... and that's just off the top of my head.

Or to look at it another way, I guarantee you can create a scenario 
where Postgres beats the pants off Oracle, *or vice versa*. So you have 
to either go with an answer along the lines of "For most workloads the 
performance of both databases is similar." or you have to benchmark the 
actual application in question. Most performance issues you find will 
probably be correctable with a moderate amount of work.

To me, the real tradeoff between Postgres and Oracle (or any other 
commercial database) is whether you'd rather spend money on expert 
employees or software contracts.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: questions about PG update performance

From
Gavin Flower
Date:
On 27/10/15 11:37, Jim Nasby wrote:
> On 10/25/15 9:36 PM, Kisung Kim wrote:
>> I want to explain for our clients that PG's update performance is
>> comparable to Oracle's.
>
> There's really only 2 ways you can answer that. You can either 
> handwave the question away ("Yes, update performance is comparable."), 
> or you have to do actual benchmarking. Trying to answer this from a 
> theoretical standpoint is completely useless because there's an absurd 
> number of things that will affect this:
>
> Number of columns
> Data types
> Size of overall transaction
> Percent of transactions that roll back
> Size of table
> What % of table is updated every day
> Underlying hardware
> What OS the database is running on
> What filesystem the database is running on
>
> ... and that's just off the top of my head.
>
> Or to look at it another way, I guarantee you can create a scenario 
> where Postgres beats the pants off Oracle, *or vice versa*. So you 
> have to either go with an answer along the lines of "For most 
> workloads the performance of both databases is similar." or you have 
> to benchmark the actual application in question. Most performance 
> issues you find will probably be correctable with a moderate amount of 
> work.
>
> To me, the real tradeoff between Postgres and Oracle (or any other 
> commercial database) is whether you'd rather spend money on expert 
> employees or software contracts.

And of course, on how you alter the tuning parameters in 
postgresql.conf, like temp_buffers and work_mem.  The 'correct' values 
will depend on your workload and amount of RAM etc.





Re: questions about PG update performance

From
Amit Kapila
Date:
On Mon, Oct 26, 2015 at 4:31 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Mon, Oct 26, 2015 at 4:09 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Oct 26, 2015 at 12:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:


On Mon, Oct 26, 2015 at 10:19 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think Oracle just copies the changed part of old row to rollback segment.
Also in Redo logs, it just writes the changed column value (both old and
new).  So for the case we are discussing in this thread (one changed
column out of 200 columns), Oracle will just write the old value of that
column in Redo and then in rollback segment, and write the new value
in Redo and then do the in-place update in heap row.


In that case, readers would pay the penalty for constructing the row.

Readers that have snapshot older than update-transaction needs to
pay such cost, otherwise all newer transactions can directly read from
page.  Also not all old-transaction readers have to pay any such cost.


Can you please explain your last sentence?


At broad level, it works this way: when the transaction starts, it
makes a note of the current SCN and then while reading a table or
an index page, it uses the SCN number to determine if the page contains
the effects of transactions that should not be visible to the current
transaction. If the page is found to contain the effects of invisible
transactions, then it recreates an older version of the page by undoing
the effects of each such transaction. Now once this older version
of page is recreated, this can be used to fetch the rows for transactions
older than the current transaction which has updated the page and
newer than the transaction which has recreated the page.  For details
you can read the blog [1] written by sometime back.