Re: PostgreSQL clustering VS MySQL clustering - Mailing list pgsql-performance

From PFC
Subject Re: PostgreSQL clustering VS MySQL clustering
Date
Msg-id opsk76kbw7th1vuj@musicbox
Whole thread Raw
In response to Re: PostgreSQL clustering VS MySQL clustering  ("Jim C. Nasby" <decibel@decibel.org>)
Responses Re: PostgreSQL clustering VS MySQL clustering
List pgsql-performance
http://borg.postgresql.org/docs/8.0/interactive/storage-page-layout.html


> If you vacuum as part of the transaction it's going to be more efficient
> of resources, because you have more of what you need right there (ie:
> odds are that you're on the same page as the old tuple). In cases like
> that it very likely makes a lot of sense to take a small hit in your
> transaction time up-front, instead of a larger hit doing a vacuum down
> the road.

    Some pros would be that you're going to make a disk write anyway because
the page is modified, so why not vacuum that page while it's there. If the
machine is CPU bound you lose, if it's IO bound you save some IO, but the
cost of index updates has to be taken into account...

    It prompted a few questions :

Note : temp contains 128k (131072) values generated from a sequence.

create table test (id serial primary key, a integer, z integer, e integer,
r integer, t integer, y integer ) without oids;
insert into test (id,a,z,e,r,t,y) select id,0,0,0,0,0,0 from temp;
  INSERT 0 131072


explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..2226.84 rows=126284 width=30) (ac Seq Scan
on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.046..964.590 rows=131072 loops=1)
  Total runtime: 15628.143 ms
tual time=0.047..617.553 rows=131072 loops=1)
  Total runtime: 4432.509 ms

explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..4453.68 rows=252568 width=30) (actual
time=52.198..611.594 rows=131072 loops=1)
  Total runtime: 5739.064 ms

explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..6680.52 rows=378852 width=30) (actual
time=127.301..848.762 rows=131072 loops=1)
  Total runtime: 6548.206 ms

Gets slower as more and more dead tuples accumulate... normal as this is a
seq scan. Note the row estimations getting bigger with the table size...


vacuum full test;
explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.019..779.864 rows=131072 loops=1)
  Total runtime: 5600.311 ms

vacuum full test;
explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.039..1021.847 rows=131072 loops=1)
  Total runtime: 5126.590 ms

-> Seems vacuum full does its job....

vacuum test;
explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual
time=36.491..860.135 rows=131072 loops=1)
  Total runtime: 7293.698 ms

vacuum test;
explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual
time=0.044..657.125 rows=131072 loops=1)
  Total runtime: 5934.141 ms

vacuum analyze test;
explain analyze update test set y=1;
  Seq Scan on test  (cost=0.00..3894.08 rows=196608 width=30) (actual
time=0.018..871.132 rows=131072 loops=1)
  Total runtime: 5548.053 ms

-> here vacuum is about as slow as vacuum full (which is normal as the
whole table is updated) however the row estimation is still off even after
ANALYZE.


  Let's create a few indices :

vacuum full test;
create index testa on test(a);
create index testz on test(z);
create index teste on test(e);
create index testr on test(r);
create index testt on test(t);
-- we don't create an index on y


vacuum full test;
explain analyze update test set a=id;
  Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.044..846.102 rows=131072 loops=1)
  Total runtime: 14998.307 ms

We see that the index updating time has made this query a lot slower. This
is normal, but :

vacuum full test;
explain analyze update test set a=id;
  Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.045..1387.626 rows=131072 loops=1)
  Total runtime: 17644.368 ms

Now, we updated ALL rows but didn't actually change a single value.
However it took about the same time as the first one. I guess the updates
all really took place, even if all it did was copy the rows with new
transaction ID's.
Now, let's update a column which is not indexed :

vacuum full test;
explain analyze update test set y=id;
  Seq Scan on test  (cost=0.00..2274.72 rows=131072 width=30) (actual
time=0.046..964.590 rows=131072 loops=1)
  Total runtime: 15628.143 ms

Takes 'bout the same time : the indexes still have to be updated to
reference the new rows after all.

So, here is something annoying with the current approach : Updating rows
in a table bloats ALL indices, not just those whose indexed values have
been actually updated. So if you have a table with many indexed fields and
you often update some obscure timestamp field, all the indices will bloat,
which will of course be corrected by VACUUM, but vacuum will have extra
work to do.

    I don't have suggestions, just questions :

    Is there a way that an update to the indices can be avoided if the
indexed values do not change ?
    Would it depend if an updated tuple can be stored on the same page it was
before (along with the old version) ?
    If the answer is Yes :
        - would saving the cost of updating the indexes pay off over vacuuming
the page on the run to try to squeeze the new tuple version in ?
        - would it be interesting to specify for each table a target % of free
space ('air holes') in pages for vacuum to try to achieve, in order to be
able to insert updated row versions on the same page they were before, and
save index updates ?

    Regards...




















pgsql-performance by date:

Previous
From: Doug McNaught
Date:
Subject: Re: Upgrading from from 7.4.2 to 8.0
Next
From: Greg Stark
Date:
Subject: Re: [SQL] OFFSET impact on Performance???