Thread: Inheritance & Indexes
Hi, I'm trying to understand an issue we've encountered when using inheritance. We've observed this behavior with both 7.2 and 7.3. The crux of the issue is that indexes are not getting used when one has a join on the parent table. Say I have two tables: hs.exon.2=> \d ga_psr_transcript Table "public.ga_psr_transcript" Column | Type | Modifiers --------------+------------------------+----------- id | integer | not null parent | integer | seqname | character varying(100) | not null source_type | smallint | not null feature_type | smallint | not null start | integer | not null stop | integer | not null strand | character(1) | not null annot_name | character varying(100) | not null depth | integer | not null hs.exon.2=> \d ga_psr_exon Table "public.ga_psr_exon" Column | Type | Modifiers -----------------------+------------------------+----------- id | integer | not null parent | integer | seqname | character varying(100) | not null source_type | smallint | not null feature_type | smallint | not null start | integer | not null stop | integer | not null strand | character(1) | not null annot_name | character varying(100) | not null transcript_cluster_id | integer | not null depth | integer | not null Both of these tables inherit from a base table ga_annot. Each of these tables are further subclassed by chromosome (these tables hold genome annotation information with several 10's thousands to several 100's thousands of rows in each of the chromosome level tables). Tuples are only loaded into the chromosome level tables. So we have: /--- ga_psr_transcript_1 /---- ga_psr_transcript_2 /---- ga_psr_transcript <----- ga_psr_transcript_3 / \---- ... ga_annot < \ /--- ga_psr_transcript_1 \ /---- ga_psr_transcript_2 \--- ga_psr_exon ----- <----- ga_psr_transcript_3 \---- ... ^ All Tuples Loaded into here There is also a foreign key relationship where the parent of ga_psr_exon_N references id in ga_psr_transcript_N. An example of "leaf" tables: hs.exon.2=> \d ga_psr_transcript_1 Table "public.ga_psr_transcript_1" Column | Type | Modifiers --------------+------------------------+----------- id | integer | not null parent | integer | seqname | character varying(100) | not null source_type | smallint | not null feature_type | smallint | not null start | integer | not null stop | integer | not null strand | character(1) | not null annot_name | character varying(100) | not null depth | integer | not null Indexes: ga_psr_transcript_1_pkey primary key btree (id), ga_psr_transcript_1_start_stop btree ("start", stop), ga_psr_transcript_1_stop btree (stop) Check constraints: "aw_psr_transcript_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand = '.'::bpchar)) Triggers: RI_ConstraintTrigger_1412526244, RI_ConstraintTrigger_1412526245 hs.exon.2=> \d ga_psr_exon_1 Table "public.ga_psr_exon_1" Column | Type | Modifiers -----------------------+------------------------+----------- id | integer | not null parent | integer | seqname | character varying(100) | not null source_type | smallint | not null feature_type | smallint | not null start | integer | not null stop | integer | not null strand | character(1) | not null annot_name | character varying(100) | not null transcript_cluster_id | integer | not null depth | integer | not null Indexes: ga_psr_exon_1_pkey primary key btree (id), ga_psr_exon_1_parent btree (parent), ga_psr_exon_1_start_stop btree ("start", stop), ga_psr_exon_1_stop btree (stop) Check constraints: "aw_psr_exon_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand = '.'::bpchar)) Triggers: RI_ConstraintTrigger_1412526088, RI_ConstraintTrigger_1412526089 hs.exon.2=> select count(*) from ga_psr_transcript_1; count ------- 43398 (1 row) hs.exon.2=> select count(*) from ga_psr_exon_1; count -------- 176908 (1 row) Now if I do a join on the "leaf" tables everything looks good: hs.exon.2=> explain select * from ga_psr_transcript_1 t, ga_psr_exon_1e where e.parent = t.id; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Merge Join (cost=0.00..9087.71 rows=176908 width=98) Merge Cond: ("outer".id = "inner".parent) -> Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t (cost=0.00..1066.17 rows=43398 width=47) -> Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e (cost=0.00..5259.52 rows=176908 width=51) (4 rows) If I do a join on the parent table, the optimizer refuses to use the indicies: hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e where e.parent = t.id; QUERY PLAN ----------------------------------------------------------------------------------------------- Merge Join (cost=1239155.37..70188119.40 rows=5514877218 width=334) Merge Cond: ("outer".id = "inner".parent) -> Sort (cost=243481.37..244816.14 rows=533908 width=165) Sort Key: t.id -> Append (cost=0.00..10980.08 rows=533908 width=165) -> Seq Scan on ga_psr_transcript t (cost=0.00..20.00rows=1000 width=165) -> Seq Scan on ga_psr_transcript_1 t (cost=0.00..881.98rows=43398 width=47) -> Seq Scan on ga_psr_transcript_2 t (cost=0.00..905.44rows=44544 width=47) -> Seq Scan on ga_psr_transcript_3 t (cost=0.00..694.82rows=34182 width=47) -> Seq Scan on ga_psr_transcript_4 t (cost=0.00..617.48rows=30348 width=47) -> Seq Scan on ga_psr_transcript_5 t (cost=0.00..653.92rows=32192 width=47) -> Seq Scan on ga_psr_transcript_6 t (cost=0.00..653.88rows=32088 width=47) -> Seq Scan on ga_psr_transcript_7 t (cost=0.00..570.88rows=28088 width=47) -> Seq Scan on ga_psr_transcript_8 t (cost=0.00..537.06rows=26406 width=47) -> Seq Scan on ga_psr_transcript_9 t (cost=0.00..460.06rows=22606 width=47) -> Seq Scan on ga_psr_transcript_10 t(cost=0.00..543.20 rows=26020 width=48) -> Seq Scan on ga_psr_transcript_11 t(cost=0.00..542.82 rows=25982 width=48) -> Seq Scan on ga_psr_transcript_12 t(cost=0.00..490.66 rows=23466 width=48) -> Seq Scan on ga_psr_transcript_13 t(cost=0.00..346.64 rows=16564 width=48) -> Seq Scan on ga_psr_transcript_14 t(cost=0.00..345.68 rows=16568 width=47) -> Seq Scan on ga_psr_transcript_15 t(cost=0.00..358.46 rows=17146 width=48) -> Seq Scan on ga_psr_transcript_16 t(cost=0.00..352.76 rows=16876 width=48) -> Seq Scan on ga_psr_transcript_17 t(cost=0.00..375.72 rows=17972 width=48) -> Seq Scan on ga_psr_transcript_18 t(cost=0.00..295.60 rows=14160 width=48) -> Seq Scan on ga_psr_transcript_19 t(cost=0.00..262.24 rows=12524 width=48) -> Seq Scan on ga_psr_transcript_20 t(cost=0.00..268.34 rows=12834 width=47) -> Seq Scan on ga_psr_transcript_21 t(cost=0.00..145.18 rows=6918 width=47) -> Seq Scan on ga_psr_transcript_22 t(cost=0.00..157.54 rows=7554 width=46) -> Seq Scan on ga_psr_transcript_x t (cost=0.00..431.08rows=21208 width=47) -> Seq Scan on ga_psr_transcript_y t (cost=0.00..60.40rows=2940 width=47) -> Seq Scan on ga_psr_transcript_un t (cost=0.00..7.14rows=314 width=55) -> Seq Scan on ga_psr_transcript_m t (cost=0.00..1.10rows=10 width=47) -> Sort (cost=995674.00..1000838.64 rows=2065853 width=169) Sort Key: e.parent -> Append (cost=0.00..43563.52 rows=2065853 width=169) -> Seq Scan on ga_psr_exon e (cost=0.00..0.00 rows=1width=169) -> Seq Scan on ga_psr_exon_1 e (cost=0.00..3691.08rows=176908 width=51) -> Seq Scan on ga_psr_exon_2 e (cost=0.00..3565.00rows=170800 width=51) -> Seq Scan on ga_psr_exon_3 e (cost=0.00..2709.88rows=129788 width=51) -> Seq Scan on ga_psr_exon_4 e (cost=0.00..2206.62rows=105662 width=51) -> Seq Scan on ga_psr_exon_5 e (cost=0.00..2376.48rows=113848 width=51) -> Seq Scan on ga_psr_exon_6 e (cost=0.00..2559.56rows=122156 width=51) -> Seq Scan on ga_psr_exon_7 e (cost=0.00..2352.78rows=112778 width=51) -> Seq Scan on ga_psr_exon_8 e (cost=0.00..1981.14rows=94914 width=51) -> Seq Scan on ga_psr_exon_9 e (cost=0.00..1807.76rows=86576 width=51) -> Seq Scan on ga_psr_exon_10 e (cost=0.00..2144.76rows=100376 width=52) -> Seq Scan on ga_psr_exon_11 e (cost=0.00..2158.90rows=100990 width=52) -> Seq Scan on ga_psr_exon_12 e (cost=0.00..2035.66rows=95266 width=52) -> Seq Scan on ga_psr_exon_13 e (cost=0.00..1212.48rows=56748 width=52) -> Seq Scan on ga_psr_exon_14 e (cost=0.00..1380.52rows=64652 width=51) -> Seq Scan on ga_psr_exon_15 e (cost=0.00..1529.88rows=71588 width=52) -> Seq Scan on ga_psr_exon_16 e (cost=0.00..1575.04rows=73704 width=52) -> Seq Scan on ga_psr_exon_17 e (cost=0.00..1811.04rows=84704 width=52) -> Seq Scan on ga_psr_exon_18 e (cost=0.00..1033.48rows=48348 width=52) -> Seq Scan on ga_psr_exon_19 e (cost=0.00..1375.44rows=64344 width=52) -> Seq Scan on ga_psr_exon_20 e (cost=0.00..1037.50rows=48550 width=51) -> Seq Scan on ga_psr_exon_21 e (cost=0.00..537.40rows=25140 width=51) -> Seq Scan on ga_psr_exon_22 e (cost=0.00..736.96rows=34596 width=50) -> Seq Scan on ga_psr_exon_x e (cost=0.00..1512.30rows=72430 width=51) -> Seq Scan on ga_psr_exon_y e (cost=0.00..205.00rows=9800 width=51) -> Seq Scan on ga_psr_exon_un e (cost=0.00..25.58rows=1158 width=59) -> Seq Scan on ga_psr_exon_m e (cost=0.00..1.28 rows=28 width=51) (62 rows) Same thing even if I'm querying for a specific tuple: hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e where e.parent = t.id and t.id = 123; QUERY PLAN ------------------------------------------------------------------------ --------------------------------------------------- Hash Join (cost=99.06..73488.33 rows=320207 width=334) Hash Cond: ("outer".parent = "inner".id) -> Append (cost=0.00..43563.52 rows=2065853 width=169) -> Seq Scan on ga_psr_exon e (cost=0.00..0.00 rows=1width=169) -> Seq Scan on ga_psr_exon_1 e (cost=0.00..3691.08rows=176908 width=51) -> Seq Scan on ga_psr_exon_2 e (cost=0.00..3565.00rows=170800 width=51) -> Seq Scan on ga_psr_exon_3 e (cost=0.00..2709.88rows=129788 width=51) -> Seq Scan on ga_psr_exon_4 e (cost=0.00..2206.62rows=105662 width=51) -> Seq Scan on ga_psr_exon_5 e (cost=0.00..2376.48rows=113848 width=51) -> Seq Scan on ga_psr_exon_6 e (cost=0.00..2559.56rows=122156 width=51) -> Seq Scan on ga_psr_exon_7 e (cost=0.00..2352.78rows=112778 width=51) -> Seq Scan on ga_psr_exon_8 e (cost=0.00..1981.14 rows=94914 width=51) -> Seq Scan on ga_psr_exon_9 e (cost=0.00..1807.76 rows=86576 width=51) -> Seq Scan on ga_psr_exon_10 e (cost=0.00..2144.76rows=100376 width=52) -> Seq Scan on ga_psr_exon_11 e (cost=0.00..2158.90rows=100990 width=52) -> Seq Scan on ga_psr_exon_12 e (cost=0.00..2035.66rows=95266 width=52) -> Seq Scan on ga_psr_exon_13 e (cost=0.00..1212.48rows=56748 width=52) -> Seq Scan on ga_psr_exon_14 e (cost=0.00..1380.52rows=64652 width=51) -> Seq Scan on ga_psr_exon_15 e (cost=0.00..1529.88rows=71588 width=52) -> Seq Scan on ga_psr_exon_16 e (cost=0.00..1575.04rows=73704 width=52) -> Seq Scan on ga_psr_exon_17 e (cost=0.00..1811.04rows=84704 width=52) -> Seq Scan on ga_psr_exon_18 e (cost=0.00..1033.48rows=48348 width=52) -> Seq Scan on ga_psr_exon_19 e (cost=0.00..1375.44rows=64344 width=52) -> Seq Scan on ga_psr_exon_20 e (cost=0.00..1037.50rows=48550 width=51) -> Seq Scan on ga_psr_exon_21 e (cost=0.00..537.40 rows=25140 width=51) -> Seq Scan on ga_psr_exon_22 e (cost=0.00..736.96 rows=34596 width=50) -> Seq Scan on ga_psr_exon_x e (cost=0.00..1512.30 rows=72430 width=51) -> Seq Scan on ga_psr_exon_y e (cost=0.00..205.00 rows=9800 width=51) -> Seq Scan on ga_psr_exon_un e (cost=0.00..25.58 rows=1158 width=59) -> Seq Scan on ga_psr_exon_m e (cost=0.00..1.28 rows=28 width=51) -> Hash (cost=98.98..98.98 rows=31 width=165) -> Append (cost=0.00..98.98 rows=31 width=165) -> Seq Scan on ga_psr_transcript t (cost=0.00..22.50 rows=5 width=165) Filter: (id = 123) -> Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t (cost=0.00..3.03 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_2_pkey on ga_psr_transcript_2 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_3_pkey on ga_psr_transcript_3 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_4_pkey on ga_psr_transcript_4 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_5_pkey on ga_psr_transcript_5 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_6_pkey on ga_psr_transcript_6 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_7_pkey on ga_psr_transcript_7 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_8_pkey on ga_psr_transcript_8 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_9_pkey on ga_psr_transcript_9 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_10_pkey on ga_psr_transcript_10 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_11_pkey on ga_psr_transcript_11 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_12_pkey on ga_psr_transcript_12 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_13_pkey on ga_psr_transcript_13 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_14_pkey on ga_psr_transcript_14 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_15_pkey on ga_psr_transcript_15 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_16_pkey on ga_psr_transcript_16 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_17_pkey on ga_psr_transcript_17 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_18_pkey on ga_psr_transcript_18 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_19_pkey on ga_psr_transcript_19 t (cost=0.00..3.01 rows=1 width=48) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_20_pkey on ga_psr_transcript_20 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_21_pkey on ga_psr_transcript_21 t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_22_pkey on ga_psr_transcript_22 t (cost=0.00..3.01 rows=1 width=46) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_x_pkey on ga_psr_transcript_x t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_y_pkey on ga_psr_transcript_y t (cost=0.00..3.01 rows=1 width=47) Index Cond: (id = 123) -> Index Scan using ga_psr_transcript_un_pkey on ga_psr_transcript_un t (cost=0.00..3.01 rows=1 width=55) Index Cond: (id = 123) -> Seq Scan on ga_psr_transcript_m t (cost=0.00..1.12 rows=1 width=47) Filter: (id = 123) (86 rows) Any comments or suggestions would be greatly appreciated. Thanks. --------------------------------------------------------- Alan Williams, PhD Alan_Williams@Affymetrix.com Bioinformatics Manager ---------------------------------------------------------
On Tue, 24 Jun 2003, Alan Williams wrote: > hs.exon.2=> \d ga_psr_transcript_1 > Table "public.ga_psr_transcript_1" > Column | Type | Modifiers > --------------+------------------------+----------- > id | integer | not null > parent | integer | > seqname | character varying(100) | not null > source_type | smallint | not null > feature_type | smallint | not null > start | integer | not null > stop | integer | not null > strand | character(1) | not null > annot_name | character varying(100) | not null > depth | integer | not null > Indexes: ga_psr_transcript_1_pkey primary key btree (id), > ga_psr_transcript_1_start_stop btree ("start", stop), > ga_psr_transcript_1_stop btree (stop) > Check constraints: "aw_psr_transcript_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand = '.'::bpchar)) > Triggers: RI_ConstraintTrigger_1412526244, > RI_ConstraintTrigger_1412526245 > > hs.exon.2=> \d ga_psr_exon_1 > Table "public.ga_psr_exon_1" > Column | Type | Modifiers > -----------------------+------------------------+----------- > id | integer | not null > parent | integer | > seqname | character varying(100) | not null > source_type | smallint | not null > feature_type | smallint | not null > start | integer | not null > stop | integer | not null > strand | character(1) | not null > annot_name | character varying(100) | not null > transcript_cluster_id | integer | not null > depth | integer | not null > Indexes: ga_psr_exon_1_pkey primary key btree (id), > ga_psr_exon_1_parent btree (parent), > ga_psr_exon_1_start_stop btree ("start", stop), > ga_psr_exon_1_stop btree (stop) > Check constraints: "aw_psr_exon_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand = '.'::bpchar)) > Triggers: RI_ConstraintTrigger_1412526088, > RI_ConstraintTrigger_1412526089 > > hs.exon.2=> select count(*) from ga_psr_transcript_1; > count > ------- > 43398 > (1 row) > > hs.exon.2=> select count(*) from ga_psr_exon_1; > count > -------- > 176908 > (1 row) > > Now if I do a join on the "leaf" tables everything looks good: > > hs.exon.2=> explain select * from ga_psr_transcript_1 t, ga_psr_exon_1e where e.parent = t.id; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------- > Merge Join (cost=0.00..9087.71 rows=176908 width=98) > Merge Cond: ("outer".id = "inner".parent) > -> Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t (cost=0.00..1066.17 rows=43398 width=47) > -> Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e (cost=0.00..5259.52 rows=176908 width=51) > (4 rows) > > If I do a join on the parent table, the optimizer refuses to use the > indicies: > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e where e.parent = t.id; In this case, you can't use a single index scan to get the rows in order so the part that makes the above a nice plan doesn't really apply. If you're getting all the rows and sorting them, index scans are probably a waste of time unless you have alot of dead space. If we supported multi-table indexes, that'd potentially let you get a plan like the above. > ----------------------------------------------------------------------------------------------- > Merge Join (cost=1239155.37..70188119.40 rows=5514877218 width=334) > Merge Cond: ("outer".id = "inner".parent) > -> Sort (cost=243481.37..244816.14 rows=533908 width=165) > Sort Key: t.id > -> Append (cost=0.00..10980.08 rows=533908 width=165) [lots of seqscans snipped] > -> Sort (cost=995674.00..1000838.64 rows=2065853 width=169) > Sort Key: e.parent > -> Append (cost=0.00..43563.52 rows=2065853 width=169) [more seqscans snipped] > Same thing even if I'm querying for a specific tuple: > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e > where e.parent = t.id and t.id = 123; ISTM it's willing to use an index scan on at least some of t's subtables. Does explicitly saying e.parent=123 help? > QUERY PLAN > ------------------------------------------------------------------------ > --------------------------------------------------- > Hash Join (cost=99.06..73488.33 rows=320207 width=334) > Hash Cond: ("outer".parent = "inner".id) > -> Append (cost=0.00..43563.52 rows=2065853 width=169) [lots of seqscans snipped again] > -> Hash (cost=98.98..98.98 rows=31 width=165) > -> Append (cost=0.00..98.98 rows=31 width=165) > -> Seq Scan on ga_psr_transcript t (cost=0.00..22.50 rows=5 width=165) > Filter: (id = 123) [snipping a bunch of index scans] > -> Seq Scan on ga_psr_transcript_m t (cost=0.00..1.12 rows=1 width=47) > Filter: (id = 123) > (86 rows)
On Tue, 24 Jun 2003, Stephan Szabo wrote: > > hs.exon.2=> explain select * from ga_psr_transcript_1 t, > ga_psr_exon_1e where e.parent = t.id; > > QUERY PLAN > > > ------------------------------------------------------------------------ > -------------------------------------------- > > Merge Join (cost=0.00..9087.71 rows=176908 width=98) > > Merge Cond: ("outer".id = "inner".parent) > > -> Index Scan using ga_psr_transcript_1_pkey on > ga_psr_transcript_1 t (cost=0.00..1066.17 rows=43398 width=47) > > -> Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e > (cost=0.00..5259.52 rows=176908 width=51) > > (4 rows) > > > > If I do a join on the parent table, the optimizer refuses to use the > > indicies: > > > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e > where e.parent = t.id; > > In this case, you can't use a single index scan to get the rows in order > so the part that makes the above a nice plan doesn't really apply. If > you're getting all the rows and sorting them, index scans are probably a > waste of time unless you have alot of dead space. If we supported > multi-table indexes, that'd potentially let you get a plan like > the above. Because of the foreign key constraint, the database engine could do the above query on each of the child tables and concatenate the results. This is because there is a notion in our schema of "paired" inheritance where both the ga_psr_exon ahd ga_psr_transcript tables are subclassed by choromosome. IE: ga_psr_exon_1.parent --> ga_psr_transcript_1.id Of course the foreign key is the only indication of this and I can't say that I'm entirely surprised that the optimizer doesn't catch this. This constraint unfortunately breaks down when joining on a non-foreign key, such as range queries (the rows in these tables represent ranges in a one dimensional space) like: explain select * from ga_psr_exon_1 e1, ga_psr_exon_1 e2 where e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and e1.stop <= e2.stop; Nested Loop (cost=0.00..995313942.65 rows=386375808 width=102) -> Seq Scan on ga_psr_exon_1 e1 (cost=0.00..3691.08 rows=176908 width=51) -> Index Scan using ga_psr_exon_1_start_stop on ga_psr_exon_1 e2 (cost=0.00..5582.47 rows=2184 width=51) Index Cond: (("outer"."start" >= e2."start") AND ("outer".stop >= e2."start") AND ("outer"."start" <= e2.stop) AND("outer".stop <= e2.stop)) versus explain select * from ga_psr_exon e1, ga_psr_exon e2 where e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and e1.stop <= e2.stop; which results in a nested loop of seq scans. (It is actually worse than this as we would really like to query for overlapping ranges, not containment.) Currently we use either a perl middleware or UNIONs to explicitly force these paired table relationships. > > > > ------------------------------------------------------------------------ > ----------------------- > > Merge Join (cost=1239155.37..70188119.40 rows=5514877218 width=334) > > Merge Cond: ("outer".id = "inner".parent) > > -> Sort (cost=243481.37..244816.14 rows=533908 width=165) > > Sort Key: t.id > > -> Append (cost=0.00..10980.08 rows=533908 width=165) > [lots of seqscans snipped] > > -> Sort (cost=995674.00..1000838.64 rows=2065853 width=169) > > Sort Key: e.parent > > -> Append (cost=0.00..43563.52 rows=2065853 width=169) > [more seqscans snipped] > > > Same thing even if I'm querying for a specific tuple: > > > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e > > where e.parent = t.id and t.id = 123; > > ISTM it's willing to use an index scan on at least some of t's > subtables. > Does explicitly saying e.parent=123 help? Yes, adding e.parent=123 results in the desired result of index scans into both tables. However, without including this the optimizer still predicts 31 results from the index scans on ga_psr_transcript* and yet insists on using a seq scan into each ga_psr_exon* table. It expects to get 2065853 rows back from the ga_psr_exon* tables when in reality it is more like 310 rows. Thanks for the response. -Alan FWIW, we subclass by chromosome for performance reasons. We have tables (at the chromosome level) with upwards of 6 million rows against which we run a variety of data mining queries.
On Tue, 24 Jun 2003, Alan Williams wrote: > > On Tue, 24 Jun 2003, Stephan Szabo wrote: > > > hs.exon.2=> explain select * from ga_psr_transcript_1 t, > > ga_psr_exon_1e where e.parent = t.id; > > > QUERY PLAN > > > > > ------------------------------------------------------------------------ > > -------------------------------------------- > > > Merge Join (cost=0.00..9087.71 rows=176908 width=98) > > > Merge Cond: ("outer".id = "inner".parent) > > > -> Index Scan using ga_psr_transcript_1_pkey on > > ga_psr_transcript_1 t (cost=0.00..1066.17 rows=43398 width=47) > > > -> Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e > > (cost=0.00..5259.52 rows=176908 width=51) > > > (4 rows) > > > > > > If I do a join on the parent table, the optimizer refuses to use the > > > indicies: > > > > > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e > > where e.parent = t.id; > > > > In this case, you can't use a single index scan to get the rows in order > > so the part that makes the above a nice plan doesn't really apply. If > > you're getting all the rows and sorting them, index scans are probably a > > waste of time unless you have alot of dead space. If we supported > > multi-table indexes, that'd potentially let you get a plan like > > the above. > > Because of the foreign key constraint, the database engine could do > the above query on each of the child tables and concatenate the > results. This is because there is a notion in our schema of "paired" I don't think it can do exon_1 -> transcript_1 union exon_2 -> transcript_2 etc from the above unless there's also a guarantee of uniqueness since if the same id showed up in transcript_1 and transcript_2 you'd have to join them both to a parent in exon_1. The individual id primary keys are not sufficient to show that though so you'd have to join exon_1 -> transcript_1 union exon_1 -> transcript_2 union exon_2 -> transcript_1... to guarantee the same results I think. I don't think that we're ever likely to figure out the optimization for those cases in any case. Multi-table indexes will probably be coming eventually which will allow a scan over that rather than the append step. > > ------------------------------------------------------------------------ > > ----------------------- > > > Merge Join (cost=1239155.37..70188119.40 rows=5514877218 width=334) > > > Merge Cond: ("outer".id = "inner".parent) > > > -> Sort (cost=243481.37..244816.14 rows=533908 width=165) > > > Sort Key: t.id > > > -> Append (cost=0.00..10980.08 rows=533908 width=165) > > [lots of seqscans snipped] > > > -> Sort (cost=995674.00..1000838.64 rows=2065853 width=169) > > > Sort Key: e.parent > > > -> Append (cost=0.00..43563.52 rows=2065853 width=169) > > [more seqscans snipped] > > > > > Same thing even if I'm querying for a specific tuple: > > > > > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e > > > where e.parent = t.id and t.id = 123; > > > > ISTM it's willing to use an index scan on at least some of t's > > subtables. > > Does explicitly saying e.parent=123 help? > > Yes, adding e.parent=123 results in the desired result of index scans > into both tables. However, without including this the optimizer still > predicts 31 results from the index scans on ga_psr_transcript* and yet > insists on using a seq scan into each ga_psr_exon* table. It expects > to get 2065853 rows back from the ga_psr_exon* tables when in reality > it is more like 310 rows. Yeah, it's guessing the number of rows rather poorly. Without the implied search condition, the index scan wouldn't help barring a small estimated number of rows in t making nested loop look good (I assume the t estimate is way off too, does analyzing the various tables or possibly raising the analyze buckets for the id column and analyzing get that estimate to something reasonable?). I *think* 7.4 may be smarter about implying these conditions as well.
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > I *think* 7.4 may be smarter about > implying these conditions as well. Not really. AFAIR the Append-style plan is the only thing you can get out of the planner for inheritance trees. This works well enough for restriction clauses like "id = constant" (since those get pushed down to the member tables, much as with UNION ALL), but it just isn't gonna be efficient for join situations. And I can't see any realistic way for the planner to realize that only some pairs of child tables need be joined. I think the decision to use an inheritance tree for performance reasons (as opposed to any logical necessity) was probably not a good one. It'd be better to store all the data in one big table and use partial indexes for faster access to subsets. regards, tom lane
On Wed, 25 Jun 2003, Tom Lane wrote: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > I *think* 7.4 may be smarter about > > implying these conditions as well. > > Not really. AFAIR the Append-style plan is the only thing you can get > out of the planner for inheritance trees. This works well enough for > restriction clauses like "id = constant" (since those get pushed down to > the member tables, much as with UNION ALL), but it just isn't gonna be > efficient for join situations. And I can't see any realistic way for > the planner to realize that only some pairs of child tables need be > joined. I was actually thinking of the table1.col=table2.col and table1.col=42 implying table2.col=42 when I wrote the above because he was also wondering why it wasn't using index scans on the table2 tree. Which now that I have access to my 7.4 box again, it does appear to.