Thread: sequence scan problem
I'm having a strange problem with a query. The query is fairly simple, with a few constants and two joins. All relevant columns should be indexed, and I'm pretty sure there aren't any type conversion issues. But the query plan includes a fairly heavy seq scan. The only possible complication is that the tables involved are fairly large - hundreds of millions of rows each. Can anyone explain this? There should only ever be a maximum of about 50 rows returned when the query is executed. Query: select fls.function_verified, fls.score, fls.go_category_group_ref, fs1.gene_ref, fs1.function_verified_exactly, fs2.gene_ref, fs2.function_verified_exactly from functional_linkage_scores fls, gene_prediction_view fs1, gene_prediction_view fs2 where fls.gene_ref1 = fs1.gene_ref and fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2 and fs2.go_term_ref = 2 Explain on query: Merge Join (cost=1331863800.16..6629339921.15 rows=352770803726 width=22) Merge Cond: (fs2.gene_ref = fls.gene_ref2) -> Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs2 (cost=0.00..6235287.98 rows=197899 width=5) Index Cond: (go_term_ref = 2) -> Materialize (cost=1331794730.41..1416453931.72 rows=6772736105 width=21) -> Sort (cost=1331794730.41..1348726570.67 rows=6772736105 width=21) Sort Key: fls.gene_ref2 -> Merge Join (cost=38762951.04..146537410.33 rows=6772736105 width=21) Merge Cond: (fs1.gene_ref = fls.gene_ref1) -> Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs1 (cost=0.00..6235287.98 rows=197899 width=5) Index Cond: (go_term_ref = 2) -> Materialize (cost=38713921.60..41618494.20 rows=232365808 width=20) -> Sort (cost=38713921.60..39294836.12 rows=232365808 width=20) Sort Key: fls.gene_ref1 -> Seq Scan on functional_linkage_scores fls (cost=0.00..3928457.08 rows=232365808 width=20) \d on functional_linkage_scores (232241678 rows): Table "public.functional_linkage_scores" Column | Type | Modifiers -----------------------+---------------+------------------------------------------------------------------------ id | integer | not null default nextval('functional_linkage_scores_id_seq'::regclass) gene_ref1 | integer | not null gene_ref2 | integer | not null function_verified | boolean | not null score | numeric(12,4) | not null go_category_group_ref | integer | not null go_term_ref | integer | Indexes: "functional_linkage_scores_pkey" PRIMARY KEY, btree (id) "functional_linkage_scores_gene_ref1_key" UNIQUE, btree (gene_ref1, gene_ref2, go_category_group_ref, go_term_ref) "ix_functional_linkage_scores_gene_ref2" btree (gene_ref2) Foreign-key constraints: "functional_linkage_scores_gene_ref1_fkey" FOREIGN KEY (gene_ref1) REFERENCES genes(id) "functional_linkage_scores_gene_ref2_fkey" FOREIGN KEY (gene_ref2) REFERENCES genes(id) "functional_linkage_scores_go_category_group_ref_fkey" FOREIGN KEY (go_category_group_ref) REFERENCES go_category_groups(id) \d on gene_prediction_view (568654245 rows): Table "public.gene_prediction_view" Column | Type | Modifiers ----------------------------------+------------------------+------------------------------------------------------------------- id | integer | not null default nextval('gene_prediction_view_id_seq'::regclass) gene_ref | integer | not null go_term_ref | integer | not null go_description | character varying(200) | not null go_category | character varying(50) | not null function_verified_exactly | boolean | not null function_verified_with_parent_go | boolean | not null score | numeric(12,4) | not null prediction_method_ref | integer | functional_score_ref | integer | Indexes: "gene_prediction_view_pkey" PRIMARY KEY, btree (id) "gene_prediction_view_functional_score_ref_key" UNIQUE, btree (functional_score_ref) "gene_prediction_view_gene_ref" UNIQUE, btree (gene_ref, go_term_ref, prediction_method_ref) Foreign-key constraints: "gene_prediction_view_functional_score_ref_fkey" FOREIGN KEY (functional_score_ref) REFERENCES functional_scores(id) "gene_prediction_view_gene_ref_fkey" FOREIGN KEY (gene_ref) REFERENCES genes(id) "gene_prediction_view_go_term_ref_fkey" FOREIGN KEY (go_term_ref) REFERENCES go_terms(term) ...and just in case someone can give advice on more aggressive settings that might help out the planner for this particular comptuer... This computer: Mac Pro / 4 gigs ram / software Raid 0 across two hard drives. Production computer: Xeon 3ghz / 32 gigs ram / Debian
John Beaver wrote: > I'm having a strange problem with a query. The query is fairly simple, > with a few constants and two joins. All relevant columns should be > indexed, and I'm pretty sure there aren't any type conversion issues. > But the query plan includes a fairly heavy seq scan. The only possible > complication is that the tables involved are fairly large - hundreds of > millions of rows each. > > Can anyone explain this? There should only ever be a maximum of about 50 > rows returned when the query is executed. You didn't say when you last vacuumed? If there should only be 50 rows returned then the estimates from the planner are way out. If that doesn't help, we'll need version info, and (if you can afford the time) an "explain analyze" Cheers, Jeremy
Jeremy Harris wrote: > John Beaver wrote: >> I'm having a strange problem with a query. The query is fairly >> simple, with a few constants and two joins. All relevant columns >> should be indexed, and I'm pretty sure there aren't any type >> conversion issues. But the query plan includes a fairly heavy seq >> scan. The only possible complication is that the tables involved are >> fairly large - hundreds of millions of rows each. >> >> Can anyone explain this? There should only ever be a maximum of about >> 50 rows returned when the query is executed. > > You didn't say when you last vacuumed? I ran 'vacuum analyze' on both tables directly after I finished building them, and I haven't updated their contents since. > If there should only be 50 rows returned then the estimates from the > planner are way out. > > If that doesn't help, we'll need version info, and (if you can afford > the time) an "explain analyze" Sure, I'm running it now. I'll send the results when it's done, but yes, it could take a while. > > Cheers, > Jeremy >
Oh, and the version is 8.3.3. Jeremy Harris wrote: > John Beaver wrote: >> I'm having a strange problem with a query. The query is fairly >> simple, with a few constants and two joins. All relevant columns >> should be indexed, and I'm pretty sure there aren't any type >> conversion issues. But the query plan includes a fairly heavy seq >> scan. The only possible complication is that the tables involved are >> fairly large - hundreds of millions of rows each. >> >> Can anyone explain this? There should only ever be a maximum of about >> 50 rows returned when the query is executed. > > You didn't say when you last vacuumed? > If there should only be 50 rows returned then the estimates from the > planner are way out. > > If that doesn't help, we'll need version info, and (if you can afford > the time) an "explain analyze" > > Cheers, > Jeremy >
John Beaver <john.e.beaver@gmail.com> writes: > Can anyone explain this? There should only ever be a maximum of about 50 > rows returned when the query is executed. Is the estimate that 197899 rows of gene_prediction_view have go_term_ref = 2 about right? If not, then we need to talk about fixing your statistics. If it is in the right ballpark then I do not see *any* plan for this query that runs in small time. The only way to avoid a seqscan on functional_linkage_scores would be to do 198K^2 index probes into it, one for each combination of matching fs1 and fs2 rows; I can guarantee you that that's not a win. The fact that the planner is estimating 352770803726 result rows compared to your estimate of 50 offers some hope that it's a stats problem, but ... regards, tom lane
Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and I vacuumed-analyzed both tables directly after they were created. # explain analyze select fls.function_verified, fls.score, fls.go_category_group_ref, fs1.gene_ref, fs1.function_verified_exactly, fs2.gene_ref, fs2.function_verified_exactly from functional_linkage_scores fls, gene_prediction_view fs1, gene_prediction_view fs2 where fls.gene_ref1 = fs1.gene_ref and fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2 and fs2.go_term_ref = 2; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Merge Join (cost=1399203593.41..6702491234.74 rows=352770803726 width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1) Merge Cond: (fs2.gene_ref = fls.gene_ref2) -> Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs2 (cost=0.00..12111899.77 rows=197899 width=5) (actual time=29.592..469838.583 rows=180629 loops=1) Index Cond: (go_term_ref = 2) -> Materialize (cost=1399069432.20..1483728633.52 rows=6772736105 width=21) (actual time=6370164.864..16623552.417 rows=15610535121 loops=1) -> Sort (cost=1399069432.20..1416001272.47 rows=6772736105 width=21) (actual time=6370164.860..13081970.248 rows=1897946790 loops=1) Sort Key: fls.gene_ref2 Sort Method: external merge Disk: 61192240kB -> Merge Join (cost=40681244.97..154286110.62 rows=6772736105 width=21) (actual time=592112.778..2043161.851 rows=1897946790 loops=1) Merge Cond: (fs1.gene_ref = fls.gene_ref1) -> Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs1 (cost=0.00..12111899.77 rows=197899 width=5) (actual time=0.015..246613.129 rows=180644 loops=1) Index Cond: (go_term_ref = 2) -> Materialize (cost=40586010.10..43490582.70 rows=232365808 width=20) (actual time=592112.755..1121366.375 rows=1897946783 loops=1) -> Sort (cost=40586010.10..41166924.62 rows=232365808 width=20) (actual time=592112.721..870349.308 rows=232241678 loops=1) Sort Key: fls.gene_ref1 Sort Method: external merge Disk: 7260856kB -> Seq Scan on functional_linkage_scores fls (cost=0.00..3928457.08 rows=232365808 width=20) (actual time=14.221..86455.902 rows=232241678 loops=1) Total runtime: 24183346.271 ms (18 rows) Jeremy Harris wrote: > John Beaver wrote: >> I'm having a strange problem with a query. The query is fairly >> simple, with a few constants and two joins. All relevant columns >> should be indexed, and I'm pretty sure there aren't any type >> conversion issues. But the query plan includes a fairly heavy seq >> scan. The only possible complication is that the tables involved are >> fairly large - hundreds of millions of rows each. >> >> Can anyone explain this? There should only ever be a maximum of about >> 50 rows returned when the query is executed. > > You didn't say when you last vacuumed? > If there should only be 50 rows returned then the estimates from the > planner are way out. > > If that doesn't help, we'll need version info, and (if you can afford > the time) an "explain analyze" > > Cheers, > Jeremy >
John Beaver <john.e.beaver@gmail.com> writes: > Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and > I vacuumed-analyzed both tables directly after they were created. > Merge Join (cost=1399203593.41..6702491234.74 rows=352770803726 > width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1) ^^^^^^^^^^^ Weren't you saying that only 50 rows should be returned? I'm thinking the real problem here is pilot error: you missed out a needed join condition or something. SQL will happily execute underconstrained queries ... regards, tom lane
<chuckle> You're right - for some reason I was looking at the (18 rows) at the bottom. Pilot error indeed - I'll haveto figure out what's going on with my data.<br /><br /> Thanks!<br /><br /> Tom Lane wrote: <blockquote cite="mid:28999.1214839555@sss.pgh.pa.us"type="cite"><pre wrap="">John Beaver <a class="moz-txt-link-rfc2396E" href="mailto:john.e.beaver@gmail.com"><john.e.beaver@gmail.com></a>writes: </pre><blockquote type="cite"><pre wrap="">Ok,here's the explain analyze result. Again, this is Postgres 8.3.3 and I vacuumed-analyzed both tables directly after they were created. </pre></blockquote><pre wrap=""> </pre><blockquote type="cite"><prewrap=""> Merge Join (cost=1399203593.41..6702491234.74 rows=352770803726 width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1) </pre></blockquote><pre wrap=""> ^^^^^^^^^^^ Weren't you saying that only 50 rows should be returned? I'm thinking the real problem here is pilot error: you missed out a needed join condition or something. SQL will happily execute underconstrained queries ... regards, tom lane </pre></blockquote>