Thread: Indexing UNIONs
Folks, I have two tables which are often browsed together through a UNION view, like: CREATE VIEW two_tables AS SELECT t1.id, t1.name, t1.abbreviation, t1.juris_id FROM t1 UNION ALL SELECT t2.id, t2.name, NULL, t2.juris_id FROM t2; This works fine as a view, since I have made the id's unique between the two tables (using a sequence). However, as t1 has 100,000 records, it is vitally important that queries against this view use an index. As it is a Union view, though, they ignore any indexes: jwnet=> explain select * from two_tables where id = 101072; NOTICE: QUERY PLAN: Subquery Scan two_tables (cost=0.00..3340.82 rows=99182 width=55) -> Append (cost=0.00..3340.82 rows=99182 width=55) -> Subquery Scan *SELECT* 1 (cost=0.00..3339.81 rows=99181 width=55) -> Seq Scan on t1 (cost=0.00..3339.81rows=99181 width=55) -> Subquery Scan *SELECT* 2 (cost=0.00..1.01 rows=1 width=28) -> Seq Scan on t2 (cost=0.00..1.01 rows=1 width=28) EXPLAIN jwnet=> explain select * from t1 where id = 101072; NOTICE: QUERY PLAN: Index Scan using t1_pkey on cases (cost=0.00..5.99 rows=1 width=150) How can I make this happen? Ideas, suggestions? And no, putting the data from both tables into one is not an option for various schema reasons. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Mon, 15 Jul 2002, Josh Berkus wrote: > Folks, > > I have two tables which are often browsed together through a UNION view, like: > > CREATE VIEW two_tables AS > SELECT t1.id, t1.name, t1.abbreviation, t1.juris_id > FROM t1 > UNION ALL > SELECT t2.id, t2.name, NULL, t2.juris_id > FROM t2; > > This works fine as a view, since I have made the id's unique between the two > tables (using a sequence). However, as t1 has 100,000 records, it is > vitally important that queries against this view use an index. We had a discussion recently on -general about this. Right now the planner won't push the conditions down into the arms of the union because noone's been sure under what conditions the optimization is safe.
Stephan, > We had a discussion recently on -general about this. Right now the > planner won't push the conditions down into the arms of the union because > noone's been sure under what conditions the optimization is safe. So, if performance is horrible with the view, I should use a dummy table to hold the Unioned data and index that instead? I can understand the difficultyof optimization. However, the example I supplied is the simplest of unions, and the two Seq Scans do seem to be proceeding against each table seperately. I think for very simple Unions (i.e. no grouping, no filtering within subqueries, etc.) that index usage would be reasonable to implement. However, I can't program it myself, so I'll have to just stick to whining and pitiful pleading <blink puppy-dog eyes, sniffle> -- -Josh BerkusAglio Database SolutionsSan Francisco
On Mon, 15 Jul 2002, Josh Berkus wrote: > Stephan, > > > We had a discussion recently on -general about this. Right now the > > planner won't push the conditions down into the arms of the union because > > noone's been sure under what conditions the optimization is safe. > > So, if performance is horrible with the view, I should use a dummy table to > hold the Unioned data and index that instead? Possibly. > I can understand the difficultyof optimization. However, the example I > supplied is the simplest of unions, and the two Seq Scans do seem to be > proceeding against each table seperately. I think for very simple Unions > (i.e. no grouping, no filtering within subqueries, etc.) that index usage > would be reasonable to implement. I don't think it's a difficulty of implementation thing. I'd guess that alot of the current stuff for shoving down conditions would apply (I haven't looked, though). It's more a case of making sure the optimization cannot be a false one that changes the results. What we need is someone to sit down and analyze the cases in a serious way. I think that for union all, conditions other than non-stable ones can be pushed down. For union, I think there might be issues due to the removal of duplicates in certain cases where the results will change, but that the results may not be deterministic in such cases anyway (like a case where two values are not exactly the same but aren't distinct due to collation or some such and so the system picks an arbitrary one and that arbitrary one affects the output of query). I have no good idea for EXCEPT and INTERSECT.
On Mon, Jul 15, 2002 at 17:31:24 -0700, Josh Berkus <josh@agliodbs.com> wrote: > Stephan, > > > We had a discussion recently on -general about this. Right now the > > planner won't push the conditions down into the arms of the union because > > noone's been sure under what conditions the optimization is safe. > > So, if performance is horrible with the view, I should use a dummy table to > hold the Unioned data and index that instead? It wouldn't have to be a dummy table. You could have both sets of data in the same table. Since they seem to be related enough that you went to the trouble to give them compatible primary keys this may not be inappropiate (though you must have had some reason for keeping them separate). You can use a flag to indicate what the data type is. If you need fast access to the smaller part of the table, a partial index might work. If the column that didn't apply to the one table is always not null in the other table, you could use is null on that column as your flag.
Bruno, > It wouldn't have to be a dummy table. You could have both sets of > data > in the same table. Per my original e-mail, this is not an option. Basically, the two tables have nothing in commmon *except* that events can be scheduled against either table. Otherwise, the two tables have vastly different data, which comes from completely different sources, and is related to a totally different set of dependant tables. So, no go. I run into this sort of thing a lot. Is it just the way I design databases, or is there a need for a more sophisticated model of relationality for SQL03? -Josh
On Tue, Jul 16, 2002 at 09:36:31 -0700, Josh Berkus <josh@agliodbs.com> wrote: > Bruno, > > > It wouldn't have to be a dummy table. You could have both sets of > > data > > in the same table. > > Per my original e-mail, this is not an option. > > Basically, the two tables have nothing in commmon *except* that events > can be scheduled against either table. Otherwise, the two tables have > vastly different data, which comes from completely different sources, > and is related to a totally different set of dependant tables. > > So, no go. > > I run into this sort of thing a lot. Is it just the way I design > databases, or is there a need for a more sophisticated model of > relationality for SQL03? This sounds like a design issue. This makes it seem like the events should be broken out into their own table and the other two tables should get joined with the events table when needed.
Bruno, > This sounds like a design issue. This makes it seem like the events > should be broken out into their own table and the other two tables > should get joined with the events table when needed. > OK, I guess I'll have to get into detail: Table "cases" is the database's third largest table, with 100,000 records, plus three dependant tables and 19 attributes (fields). Table "trial groups" is a small table listing a few dozen "cases" which are aggregated for settlement bargaining. Thus, each "trial group" relates to one to many "cases". Beyond this relationship, trial groups has only 5 attributes and 2 dependant tables. Table "events", the largest table in the database, contains event schedule listing with 11 attributes and one dependant table as well as recursive relationships between events. Each event record can be (and Must be) related to either one Case or one Trial Group. Thus, I need to relate (in views and queries) each Event to the Union of Cases and Trial Groups. I just can't figure out how to do so without the database discarding the indexes on Cases in the process and things slowing to a crawl. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Tuesday 16 Jul 2002 11:42 pm, Josh Berkus wrote: > OK, I guess I'll have to get into detail: > [detail on cases and trial-groups] > > Thus, I need to relate (in views and queries) each Event to the Union of > Cases and Trial Groups. I just can't figure out how to do so without the > database discarding the indexes on Cases in the process and things slowing > to a crawl. Well, if there was some commonality between cases and trial groups you'd have noticed it. How about two event tables, one for each type of schedulable event and unioning those? Of course, that's just shuffling the complexity around since you'll need a view with the relevant rewrites and possibly some way of detecting scheduling conflicts? - Richard Huxton
On Tue, Jul 16, 2002 at 15:42:23 -0700, Josh Berkus <josh@agliodbs.com> wrote: > > Table "events", the largest table in the database, contains event schedule > listing with 11 attributes and one dependant table as well as recursive > relationships between events. Each event record can be (and Must be) > related to either one Case or one Trial Group. > > Thus, I need to relate (in views and queries) each Event to the Union of Cases > and Trial Groups. I just can't figure out how to do so without the database > discarding the indexes on Cases in the process and things slowing to a crawl. I think you might be able to do this using (one sided) outer joins of the event table to the Case and Trial Group tables. The join rules will need to work for exactly one of the two tables. You probably will want to use case statements in the select list to pick values from the right table.
Bruno, > I think you might be able to do this using (one sided) outer joins of the event > table to the Case and Trial Group tables. The join rules will need to work for > exactly one of the two tables. You probably will want to use case statements in > the select list to pick values from the right table. Yeah, I'm doing that in some places. It just doesn't work for all queries. COALESCE() is my friend. -- -Josh BerkusAglio Database SolutionsSan Francisco
Just in case there was some misunderstanding of my suggestion here is what I had in mind. Your query: SELECT t1.id, t1.name, t1.abbreviation, t1.juris_id FROM t1 UNION ALL SELECT t2.id, t2.name, NULL, t2.juris_id FROM t2; My suggestion: SELECT t3.id, coalesce(t1.name, t2.name), t1.abbreviation, coalesce(t1.juris_id, t2.juris_id) from (t3 left join t1 using(id)) left join t2 using (id); t3 is the event table. This will result in one row for each row in t3 (since id is unique accross t1 and t2). It will contain the name, juris_id and abbreviation from whichever table matched. I expect the query to be able to make use of indexes in this form, though I haven;t tested it to make sure.
Bruno, > My suggestion: > SELECT t3.id, coalesce(t1.name, t2.name), t1.abbreviation, > coalesce(t1.juris_id, t2.juris_id) from > (t3 left join t1 using (id)) left join t2 using (id); Cool! I didn't think of that. I'll give it a try. -Josh