Thread: Views With Unions
This is stepping back quite a while; let me point people to the thread of 2003-02 where Mariusz Czu\x{0142}ada <manieq@idea.net.pl> was looking for a way of optimizing a VIEW that was a UNION. <http://archives.postgresql.org/pgsql-performance/2003-02/msg00095.php> The subject has come up a few times through PostgreSQL history, and I'd imagine to think I may have a little something new to offer to it. Let's consider a table used to store log information: create table log_table ( request_time timestamp with time zone, object character varying, -- What they asked for request_type character(8), -- What they did to it request_size integer, requestor inet, request_status integer, result_size integer, request_detail character varying ); create index log_times on log_table(request_time); create index log_object on log_table(object); Every time "something happens," an entry goes into this table. Unfortunately, the table is likely to grow to tremendous size, over time, and there are all sorts of troublesome things about purging it: -> Fragmentation may waste space and destroy the usefulness of indices; -> Deleting data row by row will cause replication logic to go mad, as triggers get invoked for every single row modified; -> The action of deletion will draw the data we just decided was _useless_ into memory, injuring cache utilization badly as we fill the cache with trash. The obvious thought: Create several tables, and join them together into a view. So instead of log_table being a table, we have log_table_1 thru log_table_3, each with the schema describe above, and define the view: create view log_table as select * from log_table_1 union all select * from log_table_2 union all select * from log_table_3; It's easy enough (modulo a little debugging and pl/pgsql work :-)) to turn this into an updatable view so that inserts into log_table use a different log table every (day|week|month). And we can TRUNCATE the eldest one, which is a cheap operation. This approach also resembles the way the "O guys" handle partitioned tables, so it's not merely about "logs." Unfortunately, selects on the VIEW are, at present, unable to make use of the indices. So if we want all log entries for June 11th, the query: select * from log_table where request_time between 'june 11 2003' and 'june 12 2003'; returns a plan: Subquery Scan log_table (cost=0.00..10950.26 rows=177126 width=314) -> Append (cost=0.00..10950.26 rows=177126 width=314) -> Subquery Scan *SELECT* 1 (cost=0.00..3089.07 rows=50307 width=71) -> Seq Scan on log_table_1 (cost=0.00..3089.07 rows=50307 width=71) -> Subquery Scan *SELECT* 2 (cost=0.00..602.92 rows=9892 width=314) -> Seq Scan on log_table_2 (cost=0.00..602.92 rows=9892 width=314) -> Subquery Scan *SELECT* 3 (cost=0.00..2390.09 rows=39209 width=314) -> Seq Scan on log_table_3 (cost=0.00..2390.09 rows=39209 width=314) In effect, the query is materialized into: select * from (select * from log_table_1 union all select * from log_table_2 union all select * from log_table_3) as merger where [request_time between 'june 11 2003' and 'june 12 2003']; What would perform better would be to attach the WHERE clause to each of the union members. (Everyone stop and sing "Solidarity Forever" :-)) E.g.: select * from ( select * from log_table_1 where request_time between 'june 11 2003' and 'june 12 2003' union all select * from log_table_2 where request_time between 'june 11 2003' and 'june 12 2003' union all select * from log_table_3 where request_time between 'june 11 2003' and 'june 12 2003' union all ) as merged_version; Subquery Scan merged_version (cost=0.00..947.04 rows=247 width=314) (actual time=55.86..1776.42 rows=20124 loops=1) -> Append (cost=0.00..947.04 rows=247 width=314) (actual time=55.84..1483.60 rows=20124 loops=1) -> Subquery Scan *SELECT* 1 (cost=0.00..3.02 rows=1 width=71) (actual time=55.83..289.81 rows=3422 loops=1) -> Index Scan using log_table_1_trans_on_idx on log_table_1 (cost=0.00..3.02 rows=1 width=71) (actual time=55.80..239.84rows=3422 loops=1) -> Subquery Scan *SELECT* 2 (cost=0.00..191.38 rows=49 width=314) (actual time=62.32..1115.15 rows=16702 loops=1) -> Index Scan using log_table_2_trans_on_idx on log_table_2 (cost=0.00..191.38 rows=49 width=314) (actualtime=62.29..873.63 rows=16702 loops=1) -> Subquery Scan *SELECT* 3 (cost=0.00..752.64 rows=196 width=314) (actual time=26.69..26.69 rows=0 loops=1) -> Index Scan using log_table_3_trans_on_idx on log_table_3 (cost=0.00..752.64 rows=196 width=314) (actualtime=26.69..26.69 rows=0 loops=1) Total runtime: 1806.39 msec Which is nice and quick, as it cuts each set down to size _before_ merging them. Mariusz had been looking, back in February, for an optimization that would, in effect, throw away the UNION ALL clauses that were unnecessary. Tom Lane and Stephan Szabo, in discussing this, observed, quite rightly, that this is liable to be an obscure sort of optimization: Tom Lane writes: > Stephan Szabo <sszabo@megazone23.bigpanda.com> writes: > > Yeah, but I think what he's hoping is that it'll notice that > > "key=1 and key=3" would be noticed as a false condition so that it doesn't > > scan those tables since a row presumably can't satisify both. The question > > would be, is the expense of checking the condition for all queries > > greater than the potential gain for these sorts of queries. > Yes, this is the key point: we won't put in an optimization that > wins on a small class of queries unless there is no material cost > added for planning cases where it doesn't apply. In contrast, I would argue that adding the WHERE clause in as an extra condition on each of the UNION subqueries is an optimization that is likely to win in _most_ cases. It helps with the example I illustrated; it would help with Mariusz' scenario, not by outright eliminating UNION subqueries, but rather by making their result sets empty. select key, value from view123 where key = 2 transforms into... select key, value from tab1 where key=1 [and key = 2] union all select key, value from tab2 where key=2 [and key = 2] union all select key, value from tab3 where key=3 [and key = 2]; The generalization is that: select * from (select [fields1] from t1 where [cond1] (UNION|UNION ALL|INTERSECT) select [fields2] from t2 where [cond2] (UNION|UNION ALL|INTERSECT) ... select [fieldsn] from tn where [condn]) as COMBINATION WHERE [globalcond]; is equivalent to: select * from (select [fields1] from t1 where ([cond1]) and [globalcond] (UNION|UNION ALL|INTERSECT) select [fields2] from t2 where ([cond2]) and [globalcond] (UNION|UNION ALL|INTERSECT) ... select [fieldsn] from tn where ([condn]) and [globalcond] ) as COMBINATION; [globalcond] has to be expressed in terms of the fields available for each subquery, but that already needs to be true, because the global condition at present is being applied to the fields that are given by the UNION/INTERSECT/UNION ALL. -- let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
On Thu, 31 Jul 2003, Christopher Browne wrote: > select * from log_table where request_time between 'june 11 2003' and > 'june 12 2003'; > > returns a plan: > Subquery Scan log_table (cost=0.00..10950.26 rows=177126 width=314) > -> Append (cost=0.00..10950.26 rows=177126 width=314) > -> Subquery Scan *SELECT* 1 (cost=0.00..3089.07 rows=50307 width=71) > -> Seq Scan on log_table_1 (cost=0.00..3089.07 rows=50307 width=71) > -> Subquery Scan *SELECT* 2 (cost=0.00..602.92 rows=9892 width=314) > -> Seq Scan on log_table_2 (cost=0.00..602.92 rows=9892 width=314) > -> Subquery Scan *SELECT* 3 (cost=0.00..2390.09 rows=39209 width=314) > -> Seq Scan on log_table_3 (cost=0.00..2390.09 rows=39209 width=314) What version are you using? In 7.3 and up it should be willing to consider moving the clause down, unless there's something like a type mismatch (because in that case it may not be equivalent without a bunch more work on the clause).
Stephan Szabo wrote: >On Thu, 31 Jul 2003, Christopher Browne wrote: > > > >> select * from log_table where request_time between 'june 11 2003' and >> 'june 12 2003'; >> >>returns a plan: >>Subquery Scan log_table (cost=0.00..10950.26 rows=177126 width=314) >> -> Append (cost=0.00..10950.26 rows=177126 width=314) >> -> Subquery Scan *SELECT* 1 (cost=0.00..3089.07 rows=50307 width=71) >> -> Seq Scan on log_table_1 (cost=0.00..3089.07 rows=50307 width=71) >> -> Subquery Scan *SELECT* 2 (cost=0.00..602.92 rows=9892 width=314) >> -> Seq Scan on log_table_2 (cost=0.00..602.92 rows=9892 width=314) >> -> Subquery Scan *SELECT* 3 (cost=0.00..2390.09 rows=39209 width=314) >> -> Seq Scan on log_table_3 (cost=0.00..2390.09 rows=39209 width=314) >> >> > >What version are you using? In 7.3 and up it should be willing to >consider moving the clause down, unless there's something like a type >mismatch (because in that case it may not be equivalent without a bunch >more work on the clause). > Dear Chris, I had the same problem(type mismatch) and it was solved finally. check the list "factoring problem ... " subject only 2 weeks back . regds mallah. > > > >---------------------------(end of broadcast)--------------------------- >TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > >
Stephan Szabo said: > > On Thu, 31 Jul 2003, Christopher Browne wrote: > >> select * from log_table where request_time between 'june 11 2003' >> and >> 'june 12 2003'; >> >> returns a plan: >> Subquery Scan log_table (cost=0.00..10950.26 rows=177126 width=314) >> -> Append (cost=0.00..10950.26 rows=177126 width=314) >> -> Subquery Scan *SELECT* 1 (cost=0.00..3089.07 rows=50307 >> width=71) >> -> Seq Scan on log_table_1 (cost=0.00..3089.07 >> rows=50307 width=71) >> -> Subquery Scan *SELECT* 2 (cost=0.00..602.92 rows=9892 >> width=314) >> -> Seq Scan on log_table_2 (cost=0.00..602.92 >> rows=9892 width=314) >> -> Subquery Scan *SELECT* 3 (cost=0.00..2390.09 rows=39209 >> width=314) >> -> Seq Scan on log_table_3 (cost=0.00..2390.09 >> rows=39209 width=314) > > What version are you using? In 7.3 and up it should be willing to > consider moving the clause down, unless there's something like a type > mismatch (because in that case it may not be equivalent without a bunch > more work on the clause). That was 7.2.4, although I had also tried it on 7.4 (yesterday's CVS). Which provides four findings: 1. On 7.2.4, adding additional type info just doesn't help, fitting with the notion that, consistent with your comment, improvement wouldn't happen earlier than 7.3. There's no help on 7.2 :-(, and the system I'm initially most interested in using this on is still on 7.2. 2. When I retried on 7.4, it _did_ find search paths based on Index Scan, when I added in additional type information. So the optimization I was wishing for _is_ there :-). In the longer term, that's very good news. 3. I'll have to test this out on 7.3.4, now, as I hadn't, and it sounds as though that is an interesting case. 4. It's often necessary to expressly specify type information in queries to get the optimizer to do the Right Thing. -- (reverse (concatenate 'string "ofni.smrytrebil@" "enworbbc")) <http://dev6.int.libertyrms.info/> Christopher Browne (416) 646 3304 x124 (land)
On Fri, 1 Aug 2003, Christopher Browne wrote: > Stephan Szabo said: > > > > > > What version are you using? In 7.3 and up it should be willing to > > consider moving the clause down, unless there's something like a type > > mismatch (because in that case it may not be equivalent without a bunch > > more work on the clause). > > That was 7.2.4, although I had also tried it on 7.4 (yesterday's CVS). > > Which provides four findings: > > 1. On 7.2.4, adding additional type info just doesn't help, fitting with > the notion that, consistent with your comment, improvement wouldn't happen > earlier than 7.3. > > There's no help on 7.2 :-(, and the system I'm initially most interested > in using this on is still on 7.2. If you really wanted you could try going back and finding the diffs associated with this in the CVS history or committers archives and see if you can make equivalent changes to 7.2, but that's possibly going to be difficult. > 2. When I retried on 7.4, it _did_ find search paths based on Index Scan, > when I added in additional type information. So the optimization I was > wishing for _is_ there :-). In the longer term, that's very good news. > > 3. I'll have to test this out on 7.3.4, now, as I hadn't, and it sounds > as though that is an interesting case. > > 4. It's often necessary to expressly specify type information in queries > to get the optimizer to do the Right Thing. Especially for cases like this. It takes the safer route of not pushing things down when it's not sure if pushing down might change the semantics (for example if a union piece has a different type from the union output, simply pushing clauses down unchanged could change the results) Tom would probably be willing to relax conditions if it could be proven safe even for the wierd outlying cases with char and varchar and such.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > On Fri, 1 Aug 2003, Christopher Browne wrote: >> Stephan Szabo said: >> > What version are you using? In 7.3 and up it should be willing to >> > consider moving the clause down, unless there's something like a type >> > mismatch (because in that case it may not be equivalent without a bunch >> > more work on the clause). >> >> That was 7.2.4, although I had also tried it on 7.4 (yesterday's CVS). >> >> Which provides four findings: >> >> 1. On 7.2.4, adding additional type info just doesn't help, fitting with >> the notion that, consistent with your comment, improvement wouldn't happen >> earlier than 7.3. >> >> There's no help on 7.2 :-(, and the system I'm initially most interested >> in using this on is still on 7.2. > > If you really wanted you could try going back and finding the diffs > associated with this in the CVS history or committers archives and see if > you can make equivalent changes to 7.2, but that's possibly going to be > difficult. Somehow I don't think that'll fly; I have taken a brief look at some of the optimizer code, and I very much don't want to leap into that at the moment. (I don't imagine I'd be able to muster much enthusiasm for the idea from others that are involved, either. More likely, I'm understating the probable opposition to the idea... :-)) I was hoping there would be some help on 7.2, but can live without it. This approach to improving log purgeability is NOT the sort of thing that you deploy on a day's notice because it seems like a "neat idea." If it waits a couple months to be implemented, that's doubtless OK. >> 2. When I retried on 7.4, it _did_ find search paths based on Index Scan, >> when I added in additional type information. So the optimization I was >> wishing for _is_ there :-). In the longer term, that's very good news. >> >> 3. I'll have to test this out on 7.3.4, now, as I hadn't, and it sounds >> as though that is an interesting case. It turns out nicely on 7.3.4, using index scans for the subqueries in the query: select count(*) from log_table where event_date between '2003-04-01' and '2003-05-01'; Which is a Good Thing. >> 4. It's often necessary to expressly specify type information in >> queries to get the optimizer to do the Right Thing. > > Especially for cases like this. It takes the safer route of not > pushing things down when it's not sure if pushing down might change > the semantics (for example if a union piece has a different type > from the union output, simply pushing clauses down unchanged could > change the results) > > Tom would probably be willing to relax conditions if it could be > proven safe even for the wierd outlying cases with char and varchar > and such. Evidently the dates of the form '2003-04-01' and such are getting types promoted properly enough. I don't see anything to "lobby" for at this point. The DOMAIN case I mentioned the other day had something odd going on that LOST the type information associated with the domain. Albeit that was on 7.3, whereas the changes in DOMAIN functionality that make them meaningfully useful come in 7.4... -- let name="cbbrowne" and tld="libertyrms.info" in name ^ "@" ^ tld;; <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
Christopher Browne <cbbrowne@libertyrms.info> writes: > The DOMAIN case I mentioned the other day had something odd going on > that LOST the type information associated with the domain. Albeit > that was on 7.3, whereas the changes in DOMAIN functionality that make > them meaningfully useful come in 7.4... Domains were a work-in-progress in 7.3, and to some extent still are. Please try to test out 7.4beta and let us know about deficiencies you find. regards, tom lane