Thread: suggestions on improving a query
Hi, I have a query that involves 3 tables. T select pubchem_compound.openeye_can_smiles, pubchem_compound.nist_inchi, dock.cid, dockscore_plp.* from dock, dockscore_plp, pubchem_compound where dock.target = '1YC1' and dock.dockid = dockscore_plp.id and dock.cid = pubchem_compound.cid order by dockscore_plp.total limit 10; The output of explain analyze is Limit (cost=0.00..387.36 rows=10 width=297) (actual time=242977.644..462748.215 rows=10 loops=1) -> Nested Loop (cost=0.00..37325186.12 rows=963575 width=297) (actual time=242977.638..462748.175 rows=10 loops=1) -> Nested Loop (cost=0.00..31523550.51 rows=963575 width=90) (actual time=242900.629..461810.902 rows=10 loops=1) -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16733229.92 rows=4669988 width=80) (actualtime=98.323..322537.605 rows=25197 loops=1) -> Index Scan using dock_pkey on dock (cost=0.00..3.15 rows=1 width=18) (actual time=5.521..5.521 rows=0loops=25197) Index Cond: (dock.dockid = "outer".id) Filter: (target = '1YC1'::text) -> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..6.01 rows=1 width=216) (actual time=93.699..93.704rows=1 loops=10) Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text) Total runtime: 462748.439 ms (10 rows) Now, the tables 'dock' and 'dockscore_plp' have 4.6M rows and 'pubchem_compound' has 10M rows. However the clause: dock.target = '1YC1' and dock.dockid = dockscore_plp.id reduces the number of rows from 4.6M to 96K. I had figured that after this the query would be very fast. But the explain analyze seems to indicate that the dockscore_plp table is being sorted (using the index plp_total_idx) in its entirety. This would then be the bottleneck Is this a correct interpretation? What I expected was that the sort would be applied to the 96K subset of dockscore_plp, rather than the whole table. Is it possible to restructure the query such that the sort is done on 96K rows rather than 4.6M rows? Thanks, ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- "I'd love to go out with you, but my favorite commercial is on TV."
This line: Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16733229.92 rows=4669988 width=80) (actual time=98.323..322537.605 rows=25197 loops=1) Means the planner did what it did, because it estimated there would be nearly 5 million rows. However, there were only 25,000. Have these tables been vacuumed & analyzed recently? Your first step should be to vacuum & analyze these, and send us the new "explain analyze". -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rajarshi Guha Sent: Monday, February 12, 2007 2:24 PM To: pgsql-general@postgresql.org Subject: [GENERAL] suggestions on improving a query Hi, I have a query that involves 3 tables. T select pubchem_compound.openeye_can_smiles, pubchem_compound.nist_inchi, dock.cid, dockscore_plp.* from dock, dockscore_plp, pubchem_compound where dock.target = '1YC1' and dock.dockid = dockscore_plp.id and dock.cid = pubchem_compound.cid order by dockscore_plp.total limit 10; The output of explain analyze is Limit (cost=0.00..387.36 rows=10 width=297) (actual time=242977.644..462748.215 rows=10 loops=1) -> Nested Loop (cost=0.00..37325186.12 rows=963575 width=297) (actual time=242977.638..462748.175 rows=10 loops=1) -> Nested Loop (cost=0.00..31523550.51 rows=963575 width=90) (actual time=242900.629..461810.902 rows=10 loops=1) -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16733229.92 rows=4669988 width=80) (actual time=98.323..322537.605 rows=25197 loops=1) -> Index Scan using dock_pkey on dock (cost=0.00..3.15 rows=1 width=18) (actual time=5.521..5.521 rows=0 loops=25197) Index Cond: (dock.dockid = "outer".id) Filter: (target = '1YC1'::text) -> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..6.01 rows=1 width=216) (actual time=93.699..93.704 rows=1 loops=10) Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text) Total runtime: 462748.439 ms (10 rows) Now, the tables 'dock' and 'dockscore_plp' have 4.6M rows and 'pubchem_compound' has 10M rows. However the clause: dock.target = '1YC1' and dock.dockid = dockscore_plp.id reduces the number of rows from 4.6M to 96K. I had figured that after this the query would be very fast. But the explain analyze seems to indicate that the dockscore_plp table is being sorted (using the index plp_total_idx) in its entirety. This would then be the bottleneck Is this a correct interpretation? What I expected was that the sort would be applied to the 96K subset of dockscore_plp, rather than the whole table. Is it possible to restructure the query such that the sort is done on 96K rows rather than 4.6M rows? Thanks, ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- "I'd love to go out with you, but my favorite commercial is on TV." ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Rajarshi Guha <rguha@indiana.edu> writes: > However the clause: > dock.target = '1YC1' and > dock.dockid = dockscore_plp.id > reduces the number of rows from 4.6M to 96K. The planner seems to be estimating about ten times that many. Perhaps increasing the statistics target for dock.target would help? regards, tom lane
"Adam Rich" <adam.r@sbcglobal.net> writes: > This line: > Index Scan using plp_total_idx on dockscore_plp > (cost=0.00..16733229.92 rows=4669988 width=80) > (actual time=98.323..322537.605 rows=25197 loops=1) > Means the planner did what it did, because it estimated there would be > nearly 5 million rows. However, there were only 25,000. No, you have to be careful about interpreting the numbers when underneath a Limit node. The rows estimate is an estimate of the total number of rows if the plan node were run to completion ... but if the Limit stops execution early, that's not what will happen. The actual rows count shows how many rows really got pulled from the node before the Limit stopped things. The real problem here is that the planner is guessing that it won't take very long to find 10 rows satisfying the target = '1YC1' condition while scanning in dockscore_plp.total order. So it chooses a plan that would have a long total runtime (notice the large cost estimates below the Limit) expecting that only a small fraction of that total will actually be expended. The expectation seems a bit off unfortunately :-(. I can't tell from the given data whether the problem is just an overestimate of the frequency of target = '1YC1', or if there's an additional effect. For example, if that target value tended to only be associated with larger values of dockscore_plp.total, then a plan like this could lose big-time because it will have to scan a long way to find those rows. regards, tom lane
On Tue, 2007-02-13 at 21:44 -0500, Tom Lane wrote: > Rajarshi Guha <rguha@indiana.edu> writes: > > However the clause: > > dock.target = '1YC1' and > > dock.dockid = dockscore_plp.id > > reduces the number of rows from 4.6M to 96K. > > The planner seems to be estimating about ten times that many. Perhaps > increasing the statistics target for dock.target would help? My original message had a typo: I expected that it should ~ 960K, so postgres is working as expected. However increasing the statistics target for dock.target did lead to an improvement in performance. Could this be because dock.target has only 5 unique values? So though the table has ~4.6M rows, each set of ~960K rows for dock.dockid is associated with a single value of dock.target. Thanks, ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- All great ideas are controversial, or have been at one time.
On Tue, 2007-02-13 at 22:04 -0500, Tom Lane wrote: > "Adam Rich" <adam.r@sbcglobal.net> writes: > > This line: > > Index Scan using plp_total_idx on dockscore_plp > > (cost=0.00..16733229.92 rows=4669988 width=80) > > (actual time=98.323..322537.605 rows=25197 loops=1) > > Means the planner did what it did, because it estimated there would be > > nearly 5 million rows. However, there were only 25,000. Sorry for not doing the obvious beforehand! I increased the statistics target for some of the columns in some of the tables and then did a vacuum analyze. Rerunning the query gives: QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..397.24 rows=10 width=268) (actual time=98322.597..171721.583 rows=10 loops=1) -> Nested Loop (cost=0.00..37182572.57 rows=936023 width=268) (actual time=98322.590..171721.543 rows=10 loops=1) -> Nested Loop (cost=0.00..31580822.05 rows=936023 width=90) (actual time=98236.963..171379.151 rows=10 loops=1) -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actualtime=54.989..102775.761 rows=25197 loops=1) -> Index Scan using dock_pkey on dock (cost=0.00..3.14 rows=1 width=18) (actual time=2.718..2.718 rows=0loops=25197) Index Cond: (dock.dockid = "outer".id) Filter: (target = '1YC1'::text) -> Index Scan using pubchem_compound_pkey on pubchem_compound (cost=0.00..5.97 rows=1 width=187) (actual time=34.221..34.223rows=1 loops=10) Index Cond: (("outer".cid)::text = (pubchem_compound.cid)::text) Total runtime: 171722.964 ms (10 rows) Clearly a big improvement in performance. (One question not directly related to the problem: when looking at the output of explain analyze, I know that one is supposed to start at the bottom and move up. Does that that the index scan on pubchem_compound is being performed first? Or should I start from the innermost line?) However it seems that it could still be improved: -> Index Scan using plp_total_idx on dockscore_plp (cost=0.00..16858401.83 rows=4669988 width=80) (actual time=54.989..102775.761rows=25197 loops=1) It looks like theres a big mismatch on the expected and observed costs and times. > The real problem here is that the planner is guessing that it won't take > very long to find 10 rows satisfying the target = '1YC1' condition while > scanning in dockscore_plp.total order. So it chooses a plan that would > have a long total runtime (notice the large cost estimates below the > Limit) expecting that only a small fraction of that total will actually > be expended. The expectation seems a bit off unfortunately :-(. > I can't tell from the given data whether the problem is just an > overestimate of the frequency of target = '1YC1', or if there's an > additional effect. I think that increasing the statistics has improved that. > For example, if that target value tended to only be > associated with larger values of dockscore_plp.total, then a plan like > this could lose big-time because it will have to scan a long way to find > those rows. This is not the case. The value '1YC1' will be associated with both high and low values of dockscore_plp.total What I would like my query to do is this: 1. From dock.target find all rows = '1YC1' 2. Using dock.dockid of these rows, get the corresponding rows in dockscore_plp 3. Using dock.cid from the rows in 2., get the corresponding rows in pubchem_compound 4. Sort and take the top 10 from step 2 (and associated rows in step 3) However now that I have written this it seems that what I really want to do is: 1. From dock.target find all rows = '1YC1' 2. Using dock.dockid of these rows, get the corresponding rows in dockscore_plp 3. Sort and take the top 10 4. Get the corresponding rows from pubchem_compound.cid The problem with this is that step is represented by the dock.cid = pubchem_compound.cid clause. It seems that if I had the cid column in dockscore_plp, then I could do a sort+limit in dockscore_plp and then simply lookup the corresponding (10) rows in pubchem_compound (rather than looking up 960K rows). The downside to this is that there are 4 more tables like dockscore_plp, and I would have to add a cid column to each of them - which seems redundant. Is it useful to increase redundancy to improve performance? Thanks for the pointers, ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- There's no problem so bad that you can't add some guilt to it to make it worse. -Calvin
On Wed, Feb 14, 2007 at 08:22:42AM -0500, Rajarshi Guha wrote: > (One question not directly related to the problem: when looking at the > output of explain analyze, I know that one is supposed to start at the > bottom and move up. Does that that the index scan on pubchem_compound is > being performed first? Or should I start from the innermost line?) There's no concept of nodes being executed before others. Each node is executed as needed. If the case of a nested loop like you have, it reads one tuple from the outer node (the first child) and then as many tuples from the inner node as necessary (an index scan may not return very many). In your case the outer node is another nested loop which will in turn scan its inner and outer nodes as necessary. The Limit up the top means that no more than that many tuples will be requested from child node, so child nodes may be executed once, many times or not at all. There are some more comprehensive writeups around, but hopefully this gives you an idea. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Rajarshi Guha <rguha@indiana.edu> writes: > Clearly a big improvement in performance. Huh? It looks like exactly the same plan as before. Any improvement you're seeing must be coming from cache effects. > It looks like theres a big mismatch on the expected and observed costs and times. Well, in the first place the estimated costs are not measured in milliseconds, and in the second place the estimated cost and rowcount are for execution of the plan node to completion, which is not happening here because of the Limit --- we'll stop the plan as soon as the top join node has produced 10 rows. In fact I'd say the whole problem here is that the planner is being too optimistic about the benefits of a fast-start plan. For whatever reason (most likely, an unfavorable correlation between dock.target and dockscore_plp.total), the desired rows aren't uniformly scattered in the output of the join, and so it's taking longer than expected to find 10 of them. regards, tom lane
Martijn van Oosterhout <kleptog@svana.org> writes: > There are some more comprehensive writeups around, but hopefully this > gives you an idea. You can find the official(tm) explanation at http://www.postgresql.org/docs/8.2/static/executor.html --- in fact, you might want to read all of chapter 42. regards, tom lane
On Wed, 2007-02-14 at 10:55 -0500, Tom Lane wrote: > Rajarshi Guha <rguha@indiana.edu> writes: > > Clearly a big improvement in performance. > > Huh? It looks like exactly the same plan as before. Any improvement > you're seeing must be coming from cache effects. Well the new run was done nearly 8 hours after the initial one - I would've thought that the cache had been purged (?) > > It looks like theres a big mismatch on the expected and observed costs and times. > > In fact I'd say the whole problem here > is that the planner is being too optimistic about the benefits of a > fast-start plan. For whatever reason (most likely, an unfavorable > correlation between dock.target and dockscore_plp.total), the desired > rows aren't uniformly scattered in the output of the join, and so it's > taking longer than expected to find 10 of them. Is there any way to solve this? I've increased the statistics target on dockscore_plp.total to 100 - does going higher help? From what you've said, it appears that the problem is arising due to lack of correlation between two columns in two tables. This is strange since, out of 4.6M rows in dock, ~ 960K will be selected and the corresponding 960K rows from dockscore_plp will be ordered and then the top 10 will be taken. So does the lack of correlation occur due to 'ordering' in the DB itself? And if this is the case, how does one fix the lack of correlation (if at all possible)? ------------------------------------------------------------------- Rajarshi Guha <rguha@indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- Regular naps prevent old age.... especially if you take them while driving