PG 8.2beta reordering working for this case? - Mailing list pgsql-sql

From Kyle Bateman
Subject PG 8.2beta reordering working for this case?
Date
Msg-id 45284DBC.3060901@actarg.com
Whole thread Raw
In response to Re: Using bitmap index scans-more efficient  (Kyle Bateman <kyle@actarg.com>)
Responses Re: PG 8.2beta reordering working for this case?
List pgsql-sql
Kyle Bateman wrote:

> Tom Lane wrote:
>
>>
>> Before 8.2 the optimizer has no ability to rearrange the order of outer
>> joins.  Do you have time to try your test case against CVS HEAD?
>>
>>
I've done some more refinement on my accounting ledger system that has
clarified some of the problems I was having with performance joining to
a union.  Here's an interesting test case.  I just tried it with PG
8.2beta1.  I don't claim to understand the new feature of reordering
queries very well, but it seems like this is related to that feature.
Since its still a performance problem in 8.2, I thought this might be a
helpful test case during beta.

To see this demo, run create.sql on a clean database.  It will create
all the needed tables, views and test data.

Then run q1 and q2 which are very efficient.  Then q3 is the slow one.
The reason is, it does the union, producing 300,000 records before
trying to select by project.  It seems like the optimizer should
internally rewrite the query to look more like what is in q4 (which is
very fast).

Is there a way to make the optimizer do this?

Kyle Bateman

-- Make test schema for demonstrating how the postgres optimizer might improve
-- performance on joins with unions

-- Contains a record for each project (job-costing) code
-- Projects are arranged in a hierarchical structure (parent/child)
-- --------------------------------------------------------------
create table proj (
    proj_id        int4 primary key,
    title        varchar,
    par            int4 references proj on update cascade
);
create index i_proj_par on proj (par);

-- Contains a record for every 2 combinations of projects which are related
-- to each other in the hierarchical project tree
-- (parent/child, ancestor/progenitor, etc.)
-- --------------------------------------------------------------
create table proj_rel (
    anst_id        int4 references proj on update cascade on delete cascade,    -- ancestor project number
    prog_id        int4 references proj on update cascade on delete cascade,    -- progenitor project number
    rel            int4,                                -- 0=self, 1=child, 2=grandchild, etc.
    primary key (anst_id, prog_id)
);

-- Contains a record for each account number and an optional alpha code to identify a sub-ledger
-- --------------------------------------------------------------
create table acct (
    acct_id        int4 primary key,    -- account number
    title        varchar,        -- name of the account
    code        varchar            -- alpha code for the account
);
create index i_acct_code on acct (code);

-- Each sub-ledger contains transactions unique to a certain part of the business
-- In addiiton to the standard fields, they all share in common, each sub-ledger
-- contains additional fields that are unique to it (so they can not all be
-- stored in a single table).  In our actual implementation, these sub-ledgers
-- are actually implemented as views joining even lower level tables.
-- --------------------------------------------------------------
create table subledg_A (
    rid            int4 primary key,        -- record ID
    amount        numeric(14,2),
    proj        int4 references proj on update cascade on delete cascade,
    unique_A        varchar                -- some other data
);
create index i_subledg_A_proj on subledg_A (proj);
-- --------------------------------------------------------------
create table subledg_B (
    rid            int4 primary key,        -- record ID
    amount        numeric(14,2),
    proj        int4 references proj on update cascade on delete cascade,
    unique_B        varchar                -- some other data
);
create index i_subledg_B_proj on subledg_B (proj);
-- --------------------------------------------------------------
create table subledg_C (
    rid            int4 primary key,        -- record ID
    amount        numeric(14,2),
    proj        int4 references proj on update cascade on delete cascade,
    unique_C        varchar                -- some other data
);
create index i_subledg_C_proj on subledg_C (proj);

-- These views allow a standard account code to presented in the appropriate ledgers
-- --------------------------------------------------------------
create view subview_A as select
   'AP ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from    subledg_A    l
        join    acct        a on a.code = 'ap';
-- --------------------------------------------------------------
create view subview_B as select
   'AR ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from    subledg_B    l
        join    acct        a on a.code = 'ar';
-- --------------------------------------------------------------
create view subview_C as select
   'PR ' || rid as trans_id,
    l.amount, l.proj,
    a.acct_id as acct
    from    subledg_C    l
        join    acct        a on a.code = 'pr';

-- General ledger - this should contain all transactions from all subledgers
-- --------------------------------------------------------------
create view gen_ledg as
    select trans_id, amount, proj, acct from subview_A
    union
    select trans_id, amount, proj, acct from subview_B
    union
    select trans_id, amount, proj, acct from subview_C;

-- Populate the project table:
insert into proj (proj_id,title,par) values ( 1, 'The main parent project', null);
insert into proj (proj_id,title,par) values ( 2, 'First  child of 1', 1);
insert into proj (proj_id,title,par) values ( 3, 'Second child of 1', 1);
insert into proj (proj_id,title,par) values ( 4, 'First  child of 2', 2);
insert into proj (proj_id,title,par) values ( 5, 'second child of 2', 2);
insert into proj (proj_id,title,par) values ( 6, 'First  child of 5', 5);
insert into proj (proj_id,title,par) values ( 7, 'Second child of 5', 5);
insert into proj (proj_id,title,par) values ( 8, 'Third  child of 5', 5);
insert into proj (proj_id,title,par) values ( 9, 'Fourth child of 5', 5);
insert into proj (proj_id,title,par) select *,'Sample project',1 from generate_series(10,5000);

-- Populate the project relationships table:
insert into proj_rel (anst_id,prog_id,rel) select proj_id, proj_id, 0 from proj;        -- self
insert into proj_rel (anst_id,prog_id,rel) select par, proj_id, 1 from proj where proj_id != 1;    -- parents
insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 2 from proj where proj_id in (4,5);
insert into proj_rel (anst_id,prog_id,rel) select 1, proj_id, 3 from proj where proj_id in (6,7,8,9);
insert into proj_rel (anst_id,prog_id,rel) select 2, proj_id, 2 from proj where proj_id in (6,7,8,9);

-- Populate the account table:
insert into acct (acct_id,title,code) values ( 100, 'Account 100', 'cash');
insert into acct (acct_id,title,code) values ( 101, 'Account 101', null);
insert into acct (acct_id,title,code) values ( 102, 'Account 102', null);
insert into acct (acct_id,title,code) values ( 103, 'Account 103', 'ar');
insert into acct (acct_id,title,code) values ( 104, 'Account 104', null);
insert into acct (acct_id,title,code) values ( 105, 'Account 105', null);
insert into acct (acct_id,title,code) values ( 106, 'Account 106', 'ap');
insert into acct (acct_id,title,code) values ( 107, 'Account 107', null);
insert into acct (acct_id,title,code) values ( 108, 'Account 108', 'pr');
insert into acct (acct_id,title,code) values ( 109, 'Account 109', null);

insert into subledg_A (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from
generate_series(1,100000);
insert into subledg_B (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from
generate_series(1,100000);
insert into subledg_C (rid,amount,proj) select *, (random() * 10000), (random() * 4999) + 1 from
generate_series(1,100000);
vacuum analyze;
-- When querying the general ledger for all transactions belonging to project 5,
-- this first query is very fast.  In spite of the union, the optimizer seems to apply
-- the condition "proj = 5" to the inner tables first, and then append the results

--explain analyze
select * from gen_ledg where proj = 5;
-- It is also very fast when trying to find all transactions for the progeny of 5
-- projects (5,6,7,8,9), but only if you are only querying a sub-ledger directly:

--explain analyze
select lg.*
    from     subledg_A    lg
    join    proj_rel    pr    on pr.prog_id = lg.proj
    where    pr.anst_id = 5;
-- But when searching the general ledger for transactions belonging to the progeny
-- of project 5, the optimizer produces the union first.  It then has to process
-- 300,000 records to find the few it needs (even though the desired fields are
-- indexed in the underlying tables).

--explain analyze
select lg.*
    from     gen_ledg    lg
    join    proj_rel    pr    on pr.prog_id = lg.proj
    where    pr.anst_id = 5;
-- It would be nice if the optimizer could rewrite the query to work like this:
-- (which is very fast)

--explain analyze
select lg.*
    from     subledg_A    lg
    join    proj_rel    pr    on pr.prog_id = lg.proj
    where    pr.anst_id = 5

union select lg.*
    from     subledg_B    lg
    join    proj_rel    pr    on pr.prog_id = lg.proj
    where    pr.anst_id = 5

union select lg.*
    from     subledg_C    lg
    join    proj_rel    pr    on pr.prog_id = lg.proj
    where    pr.anst_id = 5
;

pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: Too much time to delete 19000 rows
Next
From: Tom Lane
Date:
Subject: Re: PG 8.2beta reordering working for this case?