Re: Inheritance & Indexes - Mailing list pgsql-general

From Alan Williams
Subject Re: Inheritance & Indexes
Date
Msg-id Pine.LNX.4.44.0306241119240.21113-100000@anzio.staff.emeryville.affymetrix.com
Whole thread Raw
In response to Re: Inheritance & Indexes  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Responses Re: Inheritance & Indexes  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
List pgsql-general
On Tue, 24 Jun 2003, Stephan Szabo wrote:
> >  hs.exon.2=> explain select * from ga_psr_transcript_1 t,
> ga_psr_exon_1e where e.parent = t.id;
> >                                                      QUERY PLAN
> >
> ------------------------------------------------------------------------
> --------------------------------------------
> >  Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Index Scan using ga_psr_transcript_1_pkey on
> ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
> >    ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e
> (cost=0.00..5259.52 rows=176908 width=51)
> > (4 rows)
> >
> > If I do a join on the parent table, the optimizer refuses to use the
> > indicies:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> where e.parent = t.id;
>
> In this case, you can't use a single index scan to get the rows in order
> so the part that makes the above a nice plan doesn't really apply.  If
> you're getting all the rows and sorting them, index scans are probably a
> waste of time unless you have alot of dead space.  If we supported
> multi-table indexes, that'd potentially let you get a plan like
> the above.

Because of the foreign key constraint, the database engine could do
the above query on each of the child tables and concatenate the
results. This is because there is a notion in our schema of "paired"
inheritance where both the ga_psr_exon ahd ga_psr_transcript tables
are subclassed by choromosome. IE:

   ga_psr_exon_1.parent --> ga_psr_transcript_1.id

Of course the foreign key is the only
indication of this and I can't say that I'm entirely surprised that
the optimizer doesn't catch this.

This constraint unfortunately breaks down when joining on a non-foreign
key, such as range queries (the rows in these tables represent ranges
in a one dimensional space) like:

 explain select * from ga_psr_exon_1 e1, ga_psr_exon_1 e2 where
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and
 e1.stop <= e2.stop;

 Nested Loop  (cost=0.00..995313942.65 rows=386375808 width=102)
   ->  Seq Scan on ga_psr_exon_1 e1  (cost=0.00..3691.08 rows=176908 width=51)
   ->  Index Scan using ga_psr_exon_1_start_stop on ga_psr_exon_1 e2  (cost=0.00..5582.47 rows=2184 width=51)
         Index Cond: (("outer"."start" >= e2."start") AND ("outer".stop >= e2."start") AND ("outer"."start" <= e2.stop)
AND("outer".stop <= e2.stop)) 

versus

 explain select * from ga_psr_exon e1, ga_psr_exon e2 where
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and
 e1.stop <= e2.stop;

which results in a nested loop of seq scans. (It is actually worse
than this as we would really like to query for overlapping ranges, not
containment.)

Currently we use either a perl middleware or UNIONs to explicitly force
these paired table relationships.

>
> >
> ------------------------------------------------------------------------
> -----------------------
> >  Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
> >          Sort Key: t.id
> >          ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
> [lots of seqscans snipped]
> >    ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
> >          Sort Key: e.parent
> >          ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
> [more seqscans snipped]
>
> > Same thing even if I'm querying for a specific tuple:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> > where e.parent = t.id and t.id = 123;
>
> ISTM it's willing to use an index scan on at least some of t's
> subtables.
> Does explicitly saying e.parent=123 help?

Yes, adding e.parent=123 results in the desired result of index scans
into both tables. However, without including this the optimizer still
predicts 31 results from the index scans on ga_psr_transcript* and yet
insists on using a seq scan into each ga_psr_exon* table. It expects
to get 2065853 rows back from the ga_psr_exon* tables when in reality
it is more like 310 rows.

Thanks for the response.

-Alan

FWIW, we subclass by chromosome for performance reasons. We have
tables (at the chromosome level) with upwards of 6 million rows
against which we run a variety of data mining queries.


pgsql-general by date:

Previous
From: "Ian Harding"
Date:
Subject: COPY, but not everything...
Next
From: "scott.marlowe"
Date:
Subject: Re: Failure to install 7.3.3