Thread: Very slow update + not using clustered index
I have these two tables: Table "de.summary" Column | Type | Modifiers --------------+-----------------------------+--------------- isbn | character varying(10) | not null source | character varying(20) | not null condition | smallint | availability | smallint | price_list | numeric(11,2) | price_min | numeric(11,2) | last_update | timestamp without time zone | default now() Indexes: "summary_pkey" primary key, btree (isbn, source) Table "de.inventory" Column | Type | Modifiers --------------+-----------------------+----------- isbn | character varying(10) | condition | integer | availability | integer | price | numeric(9,2) | Indexes: "inventory_isbn_idx" btree (isbn) Both tables are clustered on their respective indexes. The entire database has been freshly VACUUM FULL'd and ANALYZE'd (after clustering). I want to run the following query, but it takes a *very* long time. Like this: bookshelf=> explain analyze update summary set price_min=0, availability=2, condition=9 where isbn = inventory.isbn and price_min = inventory.price; QUERY PLAN ----------------------------------------------------------------- ----------------------------------------------------------- Merge Join (cost=496170.66..517271.50 rows=5051 width=51) (actual time=226940.723..292247.643 rows=419277 loops=1) Merge Cond: (("outer".price_min = "inner".price) AND ("outer"."?column7?" = "inner"."?column3?")) -> Sort (cost=366877.05..371990.05 rows=2045201 width=61) (actual time=162681.929..177216.158 rows=2045200 loops=1) Sort Key: summary.price_min, (summary.isbn)::text -> Seq Scan on summary (cost=0.00..44651.01 rows=2045201 width=61) (actual time=8.139..22179.379 rows=2045201 loops=1) -> Sort(cost=129293.61..131499.09 rows=882192 width=25) (actual time=64213.663..67563.175 rows=882192 loops=1) Sort Key: inventory.price, (inventory.isbn)::text -> Seq Scan on inventory(cost=0.00..16173.92 rows=882192 width=25)(actual time=5.773..21548.942 rows=882192 loops=1) Total runtime: 3162319.477 ms(9 rows) Running what I believe to be the comparable select query is more reasonable: bookshelf=> explain analyze select s.* from summary s, inventory i where s.isbn = i.isbn and s.price_min = i.price; QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=495960.66..517061.50 rows=5051 width=59) (actual time=194048.974..215835.982 rows=419277 loops=1) Merge Cond: (("outer".price_min = "inner".price) AND ("outer"."?column8?" ="inner"."?column3?")) -> Sort (cost=366667.05..371780.05 rows=2045201 width=59) (actual time=147678.109..149945.170 rows=2045200 loops=1) Sort Key: s.price_min, (s.isbn)::text -> Seq Scan on summary s (cost=0.00..49431.01 rows=2045201 width=59) (actual time=0.056..9304.803 rows=2045201 loops=1) -> Sort (cost=129293.61..131499.09 rows=882192 width=25) (actual time=46338.696..47183.739 rows=882192 loops=1) Sort Key: i.price, (i.isbn)::text -> Seq Scan on inventory i (cost=0.00..16173.92 rows=882192 width=25) (actual time=0.089..2419.187 rows=882192 loops=1) Total runtime: 216324.171 ms I had figured that the tables would get sorted on isbn, because of the clustering. I understand why price might get chosen (fewer matches), but the planner seems to be making the wrong choice: bookshelf=> explain analyze select s.* from summary s, inventory i where s.isbn = i.isbn; QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=489500.66..512953.69 rows=882192 width=59) (actual time=152247.741..174408.812 rows=882192 loops=1) Merge Cond: ("outer"."?column8?" = "inner"."?column2?") -> Sort (cost=366667.05..371780.05 rows=2045201 width=59) (actual time=118562.097..120817.894 rows=2045146 loops=1) Sort Key:(s.isbn)::text -> Seq Scan on summary s (cost=0.00..49431.01 rows=2045201 width=59) (actual time=0.062..8766.683 rows=2045201 loops=1) -> Sort (cost=122833.61..125039.09 rows=882192 width=14)(actual time=33685.455..34480.190 rows=882192 loops=1) Sort Key:(i.isbn)::text -> Seq Scan on inventory i (cost=0.00..16173.92 rows=882192 width=14) (actual time=0.088..1942.173 rows=882192 loops=1) Total runtime: 174926.115 ms So, my first question is: why is the planner still sorting on price when isbn seems (considerably) quicker, and how can I force it to sort by isbn(if I even should)? The second question is: why, oh why does the update take such and obscenely long time to complete? The 175s (and even 216s) for the select seems reasonable given the size of the tables, but not 3000s to update the same rows. The processor (AMD 1.3GHz) is 90%+ utilization for most of the execution time. I can post more information if it would be helpful, but this post is long enough already. TIA, and happy new year. -mike -- Mike Glover Key ID BFD19F2C <mpg4@duluoz.net>
Attachment
Mike Glover <mpg4@duluoz.net> writes: > I want to run the following query, but it takes a *very* long time. > Like this: > bookshelf=> explain analyze update summary set price_min=0, > availability=2, condition=9 where isbn = inventory.isbn and price_min = > inventory.price; > ... > Total runtime: 3162319.477 ms(9 rows) > Running what I believe to be the comparable select query is more > reasonable: > bookshelf=> explain analyze select s.* from summary s, inventory i where > s.isbn = i.isbn and s.price_min = i.price; > ... > Total runtime: 216324.171 ms AFAICS these plans are identical, and therefore the difference in runtime must be ascribed to the time spent actually doing the updates. It seems unlikely that the raw row inserts and updating the single index could be quite that slow --- perhaps you have a foreign key or trigger performance problem? > So, my first question is: why is the planner still sorting on price when > isbn seems (considerably) quicker, and how can I force it to sort by > isbn(if I even should)? Is this PG 7.4? It looks to me like the planner *should* consider both possible orderings of the mergejoin sort keys. I'm not sure that it knows enough to realize that the key with more distinct values should be put first, however. A quick experiment shows that if the planner does not have any reason to prefer one ordering over another, the current coding will put the last WHERE clause first: regression=# create table t1(f1 int, f2 int); CREATE TABLE regression=# create table t2(f1 int, f2 int); CREATE TABLE regression=# explain select * from t1,t2 where t1.f1=t2.f1 and t1.f2=t2.f2; QUERY PLAN ------------------------------------------------------------------------- Merge Join (cost=139.66..154.91 rows=25 width=16) Merge Cond: (("outer".f2 = "inner".f2) AND ("outer".f1 = "inner".f1)) -> Sort (cost=69.83..72.33 rows=1000 width=8) Sort Key: t1.f2, t1.f1 -> Seq Scan on t1 (cost=0.00..20.00 rows=1000 width=8) -> Sort (cost=69.83..72.33 rows=1000 width=8) Sort Key: t2.f2, t2.f1 -> Seq Scan on t2 (cost=0.00..20.00 rows=1000 width=8) (8 rows) regression=# explain select * from t1,t2 where t1.f2=t2.f2 and t1.f1=t2.f1; QUERY PLAN ------------------------------------------------------------------------- Merge Join (cost=139.66..154.91 rows=25 width=16) Merge Cond: (("outer".f1 = "inner".f1) AND ("outer".f2 = "inner".f2)) -> Sort (cost=69.83..72.33 rows=1000 width=8) Sort Key: t1.f1, t1.f2 -> Seq Scan on t1 (cost=0.00..20.00 rows=1000 width=8) -> Sort (cost=69.83..72.33 rows=1000 width=8) Sort Key: t2.f1, t2.f2 -> Seq Scan on t2 (cost=0.00..20.00 rows=1000 width=8) (8 rows) and so you could probably improve matters just by switching the order of your WHERE clauses. Of course this answer will break as soon as anyone touches any part of the related code, so I'd like to try to fix it so that there is actually a principled choice made. Could you send along the pg_stats rows for these columns? > The second question is: why, oh why does the update take such and > obscenely long time to complete? See above --- the problem is not within the plan, but must be sought elsewhere. regards, tom lane
Tom- Thanks for the quick response. More details are inline. -mike On Thu, 01 Jan 2004 23:06:11 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Mike Glover <mpg4@duluoz.net> writes: > AFAICS these plans are identical, and therefore the difference in > runtime must be ascribed to the time spent actually doing the updates. > It seems unlikely that the raw row inserts and updating the single > index could be quite that slow --- perhaps you have a foreign key > or trigger performance problem? There are no foreign keys or triggers for either of the tables. > Is this PG 7.4? Yes, PG 7.4 > > A quick experiment shows that if the planner does not have any reason > to prefer one ordering over another, the current coding will put the > last WHERE clause first: [snip]> > and so you could probably improve matters just by switching the order > of your WHERE clauses. Of course this answer will break as soon as > anyone touches any part of the related code, so I'd like to try to fix > it so that there is actually a principled choice made. Could you send > along the pg_stats rows for these columns? > It looks like the planner is already making a principled choice: bookshelf=> explain select s.* from summary s, inventory i where s.isbn = i.isbn and s.price_min = i.price; QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=491180.66..512965.72 rows=9237 width=58) Merge Cond: (("outer".price_min = "inner".price) AND ("outer"."?column8?" = "inner"."?column3?")) -> Sort (cost=361887.05..367000.05 rows=2045201 width=58) Sort Key: s.price_min, (s.isbn)::text -> Seq Scan on summary s (cost=0.00..44651.01 rows=2045201 width=58) -> Sort (cost=129293.61..131499.09 rows=882192 width=25) Sort Key: i.price, (i.isbn)::text -> Seq Scan on inventory i (cost=0.00..16173.92 rows=882192 width=25) (8 rows) bookshelf=> explain select s.* from summary s, inventory i where s.price_min = i.price and s.isbn = i.isbn; QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=491180.66..512965.72 rows=9237 width=58) Merge Cond: (("outer".price_min = "inner".price) AND ("outer"."?column8?" ="inner"."?column3?")) -> Sort (cost=361887.05..367000.05 rows=2045201 width=58) Sort Key: s.price_min, (s.isbn)::text -> Seq Scan on summary s (cost=0.00..44651.01 rows=2045201 width=58) -> Sort(cost=129293.61..131499.09 rows=882192 width=25) Sort Key: i.price, (i.isbn)::text -> Seq Scan on inventory i (cost=0.00..16173.92 rows=882192 width=25) (8 rows) Here are the pg_stats rows: bookshelf=> select * from pg_stats where schemaname='de' and tablename='inventory' and attname='isbn'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------ +------------------+-------------------+------------------------------- ----------------------------------------------------------------------- ----------------------+------------- de | inventory | isbn | 0 | 14 | -1 | | | {0002551543,0198268211,0375507299,0486231305,0673395197,0767901576,0810 304430,0865738890,0931595029,1574160052,9971504014} | 1(1 row) bookshelf=> select * from pg_stats where schemaname='de' and tablename='inventory' and attname='price'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------ +--------------------------------------------------------------+------- ----------------------------------------------------------------------- -----------------------+----------------------------------------------- --------------------------+------------- de | inventory | price | 0 | 11 | 1628 | {59.95,0.00,54.88,53.30,60.50,64.25,73.63,49.39,50.02,53.37} | {0.259667,0.00633333,0.00533333,0.00466667,0.00466667,0.00466667,0.0046 6667,0.00433333,0.004,0.004} | {49.16,52.06,55.53,59.56,63.78,68.90,76.90,88.53,106.16,143.75,1538.88} | 0.149342(1 row) bookshelf=> select * from pg_stats where schemaname='de' and tablename='summary' and attname='isbn'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+---------+-----------+-----------+------------ +------------------+-------------------+------------------------------- ----------------------------------------------------------------------- ----------------------+------------- de | summary | isbn | 0 | 14 | -1 | | | {0001984209,020801912X,0395287693,055214911X,0722525915,0787630896,0822 218100,0883856263,1413900275,1843910381,9999955045} | 1(1 row) bookshelf=> select * from pg_stats where schemaname='de' and tablename='summary' and attname='price_min'; schemaname | tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ------------+-----------+-----------+-----------+-----------+---------- --+---------------------------------------------------------+---------- ----------------------------------------------------------------------- -------------------+--------------------------------------------------- ------------------+------------- de | summary | price_min | 0 | 10 | 1532 | {0.00,59.95,6.95,6.00,4.07,10.17,11.53,10.85,4.75,8.81} | {0.425333,0.029,0.0193333,0.00533333,0.00333333,0.00333333,0.00333333,0 .003,0.00266667,0.00266667} | {0.05,7.11,10.30,14.28,19.54,27.86,50.47,61.25,76.44,104.79,744.73} | 0.0546667(1 row) (mangled a bit by the auto-linewrap, I'm afraid) > > The second question is: why, oh why does the update take such and > > obscenely long time to complete? > > See above --- the problem is not within the plan, but must be sought > elsewhere. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- TIP 5: Have you checked our > extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html -- Mike Glover Key ID BFD19F2C <mpg4@duluoz.net>
Attachment
Mike Glover <mpg4@duluoz.net> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It seems unlikely that the raw row inserts and updating the single >> index could be quite that slow --- perhaps you have a foreign key >> or trigger performance problem? > There are no foreign keys or triggers for either of the tables. Hmph. It's clear that it is the update overhead that's taking the time (since you show 292 seconds actual time in the update's top plan node --- that's the time to find the rows and compute their new values, and all the rest of the elapsed 3162 sec has to be update overhead). Maybe you just have a slow disk. Just out of curiosity, how much time does the update take if you don't have any index on the summary table? Try create temp table tsummary as select * from summary; vacuum analyze tsummary; explain analyze update tsummary set ... ; >> A quick experiment shows that if the planner does not have any reason >> to prefer one ordering over another, the current coding will put the >> last WHERE clause first: > [snip]> > It looks like the planner is already making a principled choice: After a little bit of experimentation I was reminded that the planner does account for the possibility that a merge join can stop short of full execution when the first mergejoin columns have different data ranges. In this case it's preferring to put price first because there is a greater discrepancy in the ranges of s.price_min and i.price than there is in the ranges of the isbn columns. I'm not sure that it's wrong. You could try increasing the statistics target on the price columns (and re-ANALYZing) to see if finer-grain data changes that estimate at all. In any case, the fact that the chosen plan doesn't make use of your index on isbn doesn't mean that such a plan wasn't considered. It was, but this plan was estimated to be less expensive. You could check out alternative plans and see if the estimate is accurate by fooling with enable_seqscan and enable_sort. regards, tom lane