Thread: sequence scan problem

sequence scan problem

From
John Beaver
Date:
I'm having a strange problem with a query. The query is fairly simple,
with a few constants and two joins. All relevant columns should be
indexed, and I'm pretty sure there aren't any type conversion issues.
But the query plan includes a fairly heavy seq scan. The only possible
complication is that the tables involved are fairly large - hundreds of
millions of rows each.

Can anyone explain this? There should only ever be a maximum of about 50
rows returned when the query is executed.

Query:

select fls.function_verified, fls.score, fls.go_category_group_ref,
fs1.gene_ref, fs1.function_verified_exactly, fs2.gene_ref,
fs2.function_verified_exactly from functional_linkage_scores fls,
gene_prediction_view fs1, gene_prediction_view fs2 where fls.gene_ref1 =
fs1.gene_ref and fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2
and fs2.go_term_ref = 2

Explain on query:
Merge Join  (cost=1331863800.16..6629339921.15 rows=352770803726 width=22)
   Merge Cond: (fs2.gene_ref = fls.gene_ref2)
   ->  Index Scan using gene_prediction_view_gene_ref on
gene_prediction_view fs2  (cost=0.00..6235287.98 rows=197899 width=5)
         Index Cond: (go_term_ref = 2)
   ->  Materialize  (cost=1331794730.41..1416453931.72 rows=6772736105
width=21)
         ->  Sort  (cost=1331794730.41..1348726570.67 rows=6772736105
width=21)
               Sort Key: fls.gene_ref2
               ->  Merge Join  (cost=38762951.04..146537410.33
rows=6772736105 width=21)
                     Merge Cond: (fs1.gene_ref = fls.gene_ref1)
                     ->  Index Scan using gene_prediction_view_gene_ref
on gene_prediction_view fs1  (cost=0.00..6235287.98 rows=197899 width=5)
                           Index Cond: (go_term_ref = 2)
                     ->  Materialize  (cost=38713921.60..41618494.20
rows=232365808 width=20)
                           ->  Sort  (cost=38713921.60..39294836.12
rows=232365808 width=20)
                                 Sort Key: fls.gene_ref1
                                 ->  Seq Scan on
functional_linkage_scores fls  (cost=0.00..3928457.08 rows=232365808
width=20)


\d on functional_linkage_scores (232241678 rows):
                                    Table "public.functional_linkage_scores"
        Column         |     Type      |
Modifiers
-----------------------+---------------+------------------------------------------------------------------------
 id                    | integer       | not null default
nextval('functional_linkage_scores_id_seq'::regclass)
 gene_ref1             | integer       | not null
 gene_ref2             | integer       | not null
 function_verified     | boolean       | not null
 score                 | numeric(12,4) | not null
 go_category_group_ref | integer       | not null
 go_term_ref           | integer       |
Indexes:
    "functional_linkage_scores_pkey" PRIMARY KEY, btree (id)
    "functional_linkage_scores_gene_ref1_key" UNIQUE, btree (gene_ref1,
gene_ref2, go_category_group_ref, go_term_ref)
    "ix_functional_linkage_scores_gene_ref2" btree (gene_ref2)
Foreign-key constraints:
    "functional_linkage_scores_gene_ref1_fkey" FOREIGN KEY (gene_ref1)
REFERENCES genes(id)
    "functional_linkage_scores_gene_ref2_fkey" FOREIGN KEY (gene_ref2)
REFERENCES genes(id)
    "functional_linkage_scores_go_category_group_ref_fkey" FOREIGN KEY
(go_category_group_ref) REFERENCES go_category_groups(id)

\d on gene_prediction_view (568654245 rows):
                                              Table
"public.gene_prediction_view"
              Column              |          Type
|                             Modifiers

----------------------------------+------------------------+-------------------------------------------------------------------
 id                               | integer                | not null
default nextval('gene_prediction_view_id_seq'::regclass)
 gene_ref                         | integer                | not null
 go_term_ref                      | integer                | not null
 go_description                   | character varying(200) | not null
 go_category                      | character varying(50)  | not null
 function_verified_exactly        | boolean                | not null
 function_verified_with_parent_go | boolean                | not null
 score                            | numeric(12,4)          | not null
 prediction_method_ref            | integer                |
 functional_score_ref             | integer                |
Indexes:
    "gene_prediction_view_pkey" PRIMARY KEY, btree (id)
    "gene_prediction_view_functional_score_ref_key" UNIQUE, btree
(functional_score_ref)
    "gene_prediction_view_gene_ref" UNIQUE, btree (gene_ref,
go_term_ref, prediction_method_ref)
Foreign-key constraints:
    "gene_prediction_view_functional_score_ref_fkey" FOREIGN KEY
(functional_score_ref) REFERENCES functional_scores(id)
    "gene_prediction_view_gene_ref_fkey" FOREIGN KEY (gene_ref)
REFERENCES genes(id)
    "gene_prediction_view_go_term_ref_fkey" FOREIGN KEY (go_term_ref)
REFERENCES go_terms(term)

...and just in case someone can give advice on more aggressive settings
that might help out the planner for this particular comptuer...
This computer: Mac Pro / 4 gigs ram / software Raid 0 across two hard
drives.
Production computer: Xeon 3ghz / 32 gigs ram / Debian


Re: sequence scan problem

From
Jeremy Harris
Date:
John Beaver wrote:
> I'm having a strange problem with a query. The query is fairly simple,
> with a few constants and two joins. All relevant columns should be
> indexed, and I'm pretty sure there aren't any type conversion issues.
> But the query plan includes a fairly heavy seq scan. The only possible
> complication is that the tables involved are fairly large - hundreds of
> millions of rows each.
>
> Can anyone explain this? There should only ever be a maximum of about 50
> rows returned when the query is executed.

You didn't say when you last vacuumed?
If there should only be 50 rows returned then the estimates from the
planner are way out.

If that doesn't help, we'll need version info, and (if you can afford
the time) an "explain analyze"

Cheers,
  Jeremy

Re: sequence scan problem

From
John Beaver
Date:

Jeremy Harris wrote:
> John Beaver wrote:
>> I'm having a strange problem with a query. The query is fairly
>> simple, with a few constants and two joins. All relevant columns
>> should be indexed, and I'm pretty sure there aren't any type
>> conversion issues. But the query plan includes a fairly heavy seq
>> scan. The only possible complication is that the tables involved are
>> fairly large - hundreds of millions of rows each.
>>
>> Can anyone explain this? There should only ever be a maximum of about
>> 50 rows returned when the query is executed.
>
> You didn't say when you last vacuumed?
I ran 'vacuum analyze' on both tables directly after I finished building
them, and I haven't updated their contents since.
> If there should only be 50 rows returned then the estimates from the
> planner are way out.
>
> If that doesn't help, we'll need version info, and (if you can afford
> the time) an "explain analyze"
Sure, I'm running it now. I'll send the results when it's done, but yes,
it could take a while.
>
> Cheers,
>  Jeremy
>

Re: sequence scan problem

From
John Beaver
Date:
Oh, and the version is 8.3.3.

Jeremy Harris wrote:
> John Beaver wrote:
>> I'm having a strange problem with a query. The query is fairly
>> simple, with a few constants and two joins. All relevant columns
>> should be indexed, and I'm pretty sure there aren't any type
>> conversion issues. But the query plan includes a fairly heavy seq
>> scan. The only possible complication is that the tables involved are
>> fairly large - hundreds of millions of rows each.
>>
>> Can anyone explain this? There should only ever be a maximum of about
>> 50 rows returned when the query is executed.
>
> You didn't say when you last vacuumed?
> If there should only be 50 rows returned then the estimates from the
> planner are way out.
>
> If that doesn't help, we'll need version info, and (if you can afford
> the time) an "explain analyze"
>
> Cheers,
>  Jeremy
>

Re: sequence scan problem

From
Tom Lane
Date:
John Beaver <john.e.beaver@gmail.com> writes:
> Can anyone explain this? There should only ever be a maximum of about 50
> rows returned when the query is executed.

Is the estimate that 197899 rows of gene_prediction_view have
go_term_ref = 2 about right?  If not, then we need to talk about
fixing your statistics.  If it is in the right ballpark then I do
not see *any* plan for this query that runs in small time.
The only way to avoid a seqscan on functional_linkage_scores would
be to do 198K^2 index probes into it, one for each combination of
matching fs1 and fs2 rows; I can guarantee you that that's not a win.

The fact that the planner is estimating 352770803726 result rows
compared to your estimate of 50 offers some hope that it's a stats
problem, but ...

            regards, tom lane

Re: sequence scan problem

From
John Beaver
Date:
Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and
I vacuumed-analyzed both tables directly after they were created.


# explain analyze select fls.function_verified, fls.score,
fls.go_category_group_ref, fs1.gene_ref, fs1.function_verified_exactly,
fs2.gene_ref, fs2.function_verified_exactly from
functional_linkage_scores fls, gene_prediction_view fs1,
gene_prediction_view fs2 where fls.gene_ref1 = fs1.gene_ref and
fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2 and fs2.go_term_ref
= 2;

QUERY
PLAN


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=1399203593.41..6702491234.74 rows=352770803726
width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1)
   Merge Cond: (fs2.gene_ref = fls.gene_ref2)
   ->  Index Scan using gene_prediction_view_gene_ref on
gene_prediction_view fs2  (cost=0.00..12111899.77 rows=197899 width=5)
(actual time=29.592..469838.583 rows=180629 loops=1)
         Index Cond: (go_term_ref = 2)
   ->  Materialize  (cost=1399069432.20..1483728633.52 rows=6772736105
width=21) (actual time=6370164.864..16623552.417 rows=15610535121 loops=1)
         ->  Sort  (cost=1399069432.20..1416001272.47 rows=6772736105
width=21) (actual time=6370164.860..13081970.248 rows=1897946790 loops=1)
               Sort Key: fls.gene_ref2
               Sort Method:  external merge  Disk: 61192240kB
               ->  Merge Join  (cost=40681244.97..154286110.62
rows=6772736105 width=21) (actual time=592112.778..2043161.851
rows=1897946790 loops=1)
                     Merge Cond: (fs1.gene_ref = fls.gene_ref1)
                     ->  Index Scan using gene_prediction_view_gene_ref
on gene_prediction_view fs1  (cost=0.00..12111899.77 rows=197899
width=5) (actual time=0.015..246613.129 rows=180644 loops=1)
                           Index Cond: (go_term_ref = 2)
                     ->  Materialize  (cost=40586010.10..43490582.70
rows=232365808 width=20) (actual time=592112.755..1121366.375
rows=1897946783 loops=1)
                           ->  Sort  (cost=40586010.10..41166924.62
rows=232365808 width=20) (actual time=592112.721..870349.308
rows=232241678 loops=1)
                                 Sort Key: fls.gene_ref1
                                 Sort Method:  external merge  Disk:
7260856kB
                                 ->  Seq Scan on
functional_linkage_scores fls  (cost=0.00..3928457.08 rows=232365808
width=20) (actual time=14.221..86455.902 rows=232241678 loops=1)
 Total runtime: 24183346.271 ms
(18 rows)





Jeremy Harris wrote:
> John Beaver wrote:
>> I'm having a strange problem with a query. The query is fairly
>> simple, with a few constants and two joins. All relevant columns
>> should be indexed, and I'm pretty sure there aren't any type
>> conversion issues. But the query plan includes a fairly heavy seq
>> scan. The only possible complication is that the tables involved are
>> fairly large - hundreds of millions of rows each.
>>
>> Can anyone explain this? There should only ever be a maximum of about
>> 50 rows returned when the query is executed.
>
> You didn't say when you last vacuumed?
> If there should only be 50 rows returned then the estimates from the
> planner are way out.
>
> If that doesn't help, we'll need version info, and (if you can afford
> the time) an "explain analyze"
>
> Cheers,
>  Jeremy
>

Re: sequence scan problem

From
Tom Lane
Date:
John Beaver <john.e.beaver@gmail.com> writes:
> Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and
> I vacuumed-analyzed both tables directly after they were created.

>  Merge Join  (cost=1399203593.41..6702491234.74 rows=352770803726
> width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1)
                                                        ^^^^^^^^^^^

Weren't you saying that only 50 rows should be returned?  I'm thinking
the real problem here is pilot error: you missed out a needed join
condition or something.  SQL will happily execute underconstrained
queries ...

            regards, tom lane

Re: sequence scan problem

From
John Beaver
Date:
<chuckle> You're right - for some reason I was looking at the (18 rows) at the bottom. Pilot error indeed - I'll
haveto figure out what's going on with my data.<br /><br /> Thanks!<br /><br /> Tom Lane wrote: <blockquote
cite="mid:28999.1214839555@sss.pgh.pa.us"type="cite"><pre wrap="">John Beaver <a class="moz-txt-link-rfc2396E"
href="mailto:john.e.beaver@gmail.com"><john.e.beaver@gmail.com></a>writes: </pre><blockquote type="cite"><pre
wrap="">Ok,here's the explain analyze result. Again, this is Postgres 8.3.3 and 
 
I vacuumed-analyzed both tables directly after they were created.   </pre></blockquote><pre wrap=""> </pre><blockquote
type="cite"><prewrap=""> Merge Join  (cost=1399203593.41..6702491234.74 rows=352770803726 
 
width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1)   </pre></blockquote><pre wrap="">
                                            ^^^^^^^^^^^
 

Weren't you saying that only 50 rows should be returned?  I'm thinking
the real problem here is pilot error: you missed out a needed join
condition or something.  SQL will happily execute underconstrained
queries ...
        regards, tom lane
 </pre></blockquote>