PG 8.2beta reordering working for this case? - Mailing list pgsql-sql
From | Kyle Bateman |
---|---|
Subject | PG 8.2beta reordering working for this case? |
Date | |
Msg-id | 45284DBC.3060901@actarg.com Whole thread Raw |
In response to | Re: Using bitmap index scans-more efficient (Kyle Bateman <kyle@actarg.com>) |
Responses |
Re: PG 8.2beta reordering working for this case?
|
List | pgsql-sql |
Kyle Bateman wrote: > Tom Lane wrote: > >> >> Before 8.2 the optimizer has no ability to rearrange the order of outer >> joins. Do you have time to try your test case against CVS HEAD? >> >> I've done some more refinement on my accounting ledger system that has clarified some of the problems I was having with performance joining to a union. Here's an interesting test case. I just tried it with PG 8.2beta1. I don't claim to understand the new feature of reordering queries very well, but it seems like this is related to that feature. Since its still a performance problem in 8.2, I thought this might be a helpful test case during beta. To see this demo, run create.sql on a clean database. It will create all the needed tables, views and test data. Then run q1 and q2 which are very efficient. Then q3 is the slow one. The reason is, it does the union, producing 300,000 records before trying to select by project. It seems like the optimizer should internally rewrite the query to look more like what is in q4 (which is very fast). Is there a way to make the optimizer do this? Kyle Bateman -- Make test schema for demonstrating how the postgres optimizer might improve -- performance on joins with unions -- Contains a record for each project (job-costing) code -- Projects are arranged in a hierarchical structure (parent/child) -- -------------------------------------------------------------- create table proj ( proj_id int4 primary key, title varchar, par int4 references proj on update cascade ); create index i_proj_par on proj (par); -- Contains a record for every 2 combinations of projects which are related -- to each other in the hierarchical project tree -- (parent/child, ancestor/progenitor, etc.) -- -------------------------------------------------------------- create table proj_rel ( anst_id int4 references proj on update cascade on delete cascade, -- ancestor project number prog_id int4 references proj on update cascade on delete cascade, -- progenitor project number rel int4, -- 0=self, 1=child, 2=grandchild, etc. primary key (anst_id, prog_id) ); -- Contains a record for each account number and an optional alpha code to identify a sub-ledger -- -------------------------------------------------------------- create table acct ( acct_id int4 primary key, -- account number title varchar, -- name of the account code varchar -- alpha code for the account ); create index i_acct_code on acct (code); -- Each sub-ledger contains transactions unique to a certain part of the business -- In addiiton to the standard fields, they all share in common, each sub-ledger -- contains additional fields that are unique to it (so they can not all be -- stored in a single table). In our actual implementation, these sub-ledgers -- are actually implemented as views joining even lower level tables. -- -------------------------------------------------------------- create table subledg_A ( rid int4 primary key, -- record ID amount numeric(14,2), proj int4 references proj on update cascade on delete cascade, unique_A varchar -- some other data ); create index i_subledg_A_proj on subledg_A (proj); -- -------------------------------------------------------------- create table subledg_B ( rid int4 primary key, -- record ID amount numeric(14,2), proj int4 references proj on update cascade on delete cascade, unique_B varchar -- some other data ); create index i_subledg_B_proj on subledg_B (proj); -- -------------------------------------------------------------- create table subledg_C ( rid int4 primary key, -- record ID amount numeric(14,2), proj int4 references proj on update cascade on delete cascade, unique_C varchar -- some other data ); create index i_subledg_C_proj on subledg_C (proj); -- These views allow a standard account code to presented in the appropriate ledgers -- -------------------------------------------------------------- create view subview_A as select 'AP ' || rid as trans_id, l.amount, l.proj, a.acct_id as acct from subledg_A l join acct a on a.code = 'ap'; -- -------------------------------------------------------------- create view subview_B as select 'AR ' || rid as trans_id, l.amount, l.proj, a.acct_id as acct from subledg_B l join acct a on a.code = 'ar'; -- -------------------------------------------------------------- create view subview_C as select 'PR ' || rid as trans_id, l.amount, l.proj, a.acct_id as acct from subledg_C l join acct a on a.code = 'pr'; -- General ledger - this should contain all transactions from all subledgers -- -------------------------------------------------------------- create view gen_ledg as select trans_id, amount, proj, acct from subview_A union select trans_id, amount, proj, acct from subview_B union select trans_id, amount, proj, acct from subview_C; -- Populate the project table: insert into proj (proj_id,title,par) values ( 1, 'The main parent project', null); insert into proj (proj_id,title,par) values ( 2, 'First child of 1', 1); insert into proj (proj_id,title,par) values ( 3, 'Second child of 1', 1); insert into proj (proj_id,title,par) values ( 4, 'First child of 2', 2); insert into proj (proj_id,title,par) values ( 5, 'second child of 2', 2); insert into proj (proj_id,title,par) values ( 6, 'First child of 5', 5); insert into proj (proj_id,title,par) values ( 7, 'Second child of 5', 5); insert into proj (proj_id,title,par) values ( 8, 'Third child of 5', 5); insert into proj (proj_id,title,par) values ( 9, 'Fourth child of 5', 5); insert into proj (proj_id,title,par) select *,'Sample project',1 from generate_series(10,5000); -- Populate the project relationships table: insert into proj_rel (anst_id,prog_id,rel) select proj_id, proj_id, 0 from proj; -- self insert into proj_rel (anst_id,prog_id,rel) select par, proj_id, 1 from proj where proj_id != 1; -- parents insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 2 from proj where proj_id in (4,5); insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 3 from proj where proj_id in (6,7,8,9); insert into proj_rel (anst_id,prog_id,rel) select 2, proj_id, 2 from proj where proj_id in (6,7,8,9); -- Populate the account table: insert into acct (acct_id,title,code) values ( 100, 'Account 100', 'cash'); insert into acct (acct_id,title,code) values ( 101, 'Account 101', null); insert into acct (acct_id,title,code) values ( 102, 'Account 102', null); insert into acct (acct_id,title,code) values ( 103, 'Account 103', 'ar'); insert into acct (acct_id,title,code) values ( 104, 'Account 104', null); insert into acct (acct_id,title,code) values ( 105, 'Account 105', null); insert into acct (acct_id,title,code) values ( 106, 'Account 106', 'ap'); insert into acct (acct_id,title,code) values ( 107, 'Account 107', null); insert into acct (acct_id,title,code) values ( 108, 'Account 108', 'pr'); insert into acct (acct_id,title,code) values ( 109, 'Account 109', null); insert into subledg_A (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000); insert into subledg_B (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000); insert into subledg_C (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from generate_series(1,100000); vacuum analyze; -- When querying the general ledger for all transactions belonging to project 5, -- this first query is very fast. In spite of the union, the optimizer seems to apply -- the condition "proj = 5" to the inner tables first, and then append the results --explain analyze select * from gen_ledg where proj = 5; -- It is also very fast when trying to find all transactions for the progeny of 5 -- projects (5,6,7,8,9), but only if you are only querying a sub-ledger directly: --explain analyze select lg.* from subledg_A lg join proj_rel pr on pr.prog_id = lg.proj where pr.anst_id = 5; -- But when searching the general ledger for transactions belonging to the progeny -- of project 5, the optimizer produces the union first. It then has to process -- 300,000 records to find the few it needs (even though the desired fields are -- indexed in the underlying tables). --explain analyze select lg.* from gen_ledg lg join proj_rel pr on pr.prog_id = lg.proj where pr.anst_id = 5; -- It would be nice if the optimizer could rewrite the query to work like this: -- (which is very fast) --explain analyze select lg.* from subledg_A lg join proj_rel pr on pr.prog_id = lg.proj where pr.anst_id = 5 union select lg.* from subledg_B lg join proj_rel pr on pr.prog_id = lg.proj where pr.anst_id = 5 union select lg.* from subledg_C lg join proj_rel pr on pr.prog_id = lg.proj where pr.anst_id = 5 ;