Re: PostgreSQL clustering VS MySQL clustering - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: PostgreSQL clustering VS MySQL clustering |
Date | |
Msg-id | opsk76kbw7th1vuj@musicbox Whole thread Raw |
In response to | Re: PostgreSQL clustering VS MySQL clustering ("Jim C. Nasby" <decibel@decibel.org>) |
Responses |
Re: PostgreSQL clustering VS MySQL clustering
|
List | pgsql-performance |
http://borg.postgresql.org/docs/8.0/interactive/storage-page-layout.html > If you vacuum as part of the transaction it's going to be more efficient > of resources, because you have more of what you need right there (ie: > odds are that you're on the same page as the old tuple). In cases like > that it very likely makes a lot of sense to take a small hit in your > transaction time up-front, instead of a larger hit doing a vacuum down > the road. Some pros would be that you're going to make a disk write anyway because the page is modified, so why not vacuum that page while it's there. If the machine is CPU bound you lose, if it's IO bound you save some IO, but the cost of index updates has to be taken into account... It prompted a few questions : Note : temp contains 128k (131072) values generated from a sequence. create table test (id serial primary key, a integer, z integer, e integer, r integer, t integer, y integer ) without oids; insert into test (id,a,z,e,r,t,y) select id,0,0,0,0,0,0 from temp; INSERT 0 131072 explain analyze update test set y=1; Seq Scan on test (cost=0.00..2226.84 rows=126284 width=30) (ac Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.046..964.590 rows=131072 loops=1) Total runtime: 15628.143 ms tual time=0.047..617.553 rows=131072 loops=1) Total runtime: 4432.509 ms explain analyze update test set y=1; Seq Scan on test (cost=0.00..4453.68 rows=252568 width=30) (actual time=52.198..611.594 rows=131072 loops=1) Total runtime: 5739.064 ms explain analyze update test set y=1; Seq Scan on test (cost=0.00..6680.52 rows=378852 width=30) (actual time=127.301..848.762 rows=131072 loops=1) Total runtime: 6548.206 ms Gets slower as more and more dead tuples accumulate... normal as this is a seq scan. Note the row estimations getting bigger with the table size... vacuum full test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.019..779.864 rows=131072 loops=1) Total runtime: 5600.311 ms vacuum full test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.039..1021.847 rows=131072 loops=1) Total runtime: 5126.590 ms -> Seems vacuum full does its job.... vacuum test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=36.491..860.135 rows=131072 loops=1) Total runtime: 7293.698 ms vacuum test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=0.044..657.125 rows=131072 loops=1) Total runtime: 5934.141 ms vacuum analyze test; explain analyze update test set y=1; Seq Scan on test (cost=0.00..3894.08 rows=196608 width=30) (actual time=0.018..871.132 rows=131072 loops=1) Total runtime: 5548.053 ms -> here vacuum is about as slow as vacuum full (which is normal as the whole table is updated) however the row estimation is still off even after ANALYZE. Let's create a few indices : vacuum full test; create index testa on test(a); create index testz on test(z); create index teste on test(e); create index testr on test(r); create index testt on test(t); -- we don't create an index on y vacuum full test; explain analyze update test set a=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.044..846.102 rows=131072 loops=1) Total runtime: 14998.307 ms We see that the index updating time has made this query a lot slower. This is normal, but : vacuum full test; explain analyze update test set a=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.045..1387.626 rows=131072 loops=1) Total runtime: 17644.368 ms Now, we updated ALL rows but didn't actually change a single value. However it took about the same time as the first one. I guess the updates all really took place, even if all it did was copy the rows with new transaction ID's. Now, let's update a column which is not indexed : vacuum full test; explain analyze update test set y=id; Seq Scan on test (cost=0.00..2274.72 rows=131072 width=30) (actual time=0.046..964.590 rows=131072 loops=1) Total runtime: 15628.143 ms Takes 'bout the same time : the indexes still have to be updated to reference the new rows after all. So, here is something annoying with the current approach : Updating rows in a table bloats ALL indices, not just those whose indexed values have been actually updated. So if you have a table with many indexed fields and you often update some obscure timestamp field, all the indices will bloat, which will of course be corrected by VACUUM, but vacuum will have extra work to do. I don't have suggestions, just questions : Is there a way that an update to the indices can be avoided if the indexed values do not change ? Would it depend if an updated tuple can be stored on the same page it was before (along with the old version) ? If the answer is Yes : - would saving the cost of updating the indexes pay off over vacuuming the page on the run to try to squeeze the new tuple version in ? - would it be interesting to specify for each table a target % of free space ('air holes') in pages for vacuum to try to achieve, in order to be able to insert updated row versions on the same page they were before, and save index updates ? Regards...
pgsql-performance by date: