Thread: improving performance of UNION and ORDER BY

improving performance of UNION and ORDER BY

From
Chris Gamache
Date:
Three tables...

 Table "a_trans_log"
     Attribute     |           Type           | Modifier
-------------------+--------------------------+----------------------
 acctnum           | character varying(50)    |
 username          | character varying(50)    |
 completed         | timestamp with time zone |
 trans_date        | timestamp with time zone |
 id                | integer                  | not null default nextval
('a_id_seq'::text)
 v_val             | real                     |
 pgnum             | integer                  |
 trans_type        | character varying(50)    |
 trans_data        | character varying(50)    |
 user_reference_id | character varying(50)    |
 trans_charge      | money                    |
Indices: a_date_type_idx,
         a_trans_log_pkey,
         a_type_date_name_idx,
         a_username_idx

 Table "b_trans_log"
     Attribute     |           Type           | Modifier
-------------------+--------------------------+---------------------
 acctnum           | character varying(50)    |
 username          | character varying(50)    |
 completed         | timestamp with time zone |
 trans_date        | timestamp with time zone |
 id                | integer                  | not null default
nextval('b_id_seq'::text)
 trans_type        | character varying(50)    |
 trans_data        | character varying(50)    |
 user_reference_id | character varying(50)    |
 trans_charge      | money                    |
Indices: b_date_type_idx,
         b_trans_log_pkey,
         b_type_date_name_idx,
         b_username_idx

 Table "addtypelong"
 Attribute |         Type          |  Modifier
-----------+-----------------------+--------------------------
 id        | integer               | not null default
nextval('addtypelong_pkey_id'::text)
 shorttype | character varying(50) |
 longtype  | character varying(50) |
Index: addtypelong_pkey

The Query...

select
  a.username as "User",
  a.trans_date as "Date",
  tl.longtype as "Type",
  a.trans_data as "Query Data",
  a.trans_charge as "Charged",
  a.user_reference_id as "Reference ID"
from a_trans_log a, addtypelong tl
where (
  (tl.shorttype=a.trans_type) and
  (a.trans_date >= '12/31/01'::TIMESTAMP)
)
union
select
  b.username as "User",
  b.trans_date as "Date",
  tl.longtype as "Type",
  b.trans_data as "Query Data",
  b.trans_charge as "Charged",
  b.user_reference_id as "Reference ID"
from b_trans_log b, addtypelong tl
where (
  (tl.shorttype=b.trans_type) and
  (b.trans_date >= '12/31/01'::TIMESTAMP)
)
order by 2 desc, 4 limit 20;

The plan...


Limit  (cost=13349.87..13349.87 rows=20 width=84)
  ->  Sort  (cost=13349.87..13349.87 rows=2619 width=84)
        ->  Unique  (cost=12808.41..13201.20 rows=2619 width=84)
              ->  Sort  (cost=12808.41..12808.41 rows=26186 width=84)
                    ->  Append  (cost=1.20..10153.39 rows=26186 width=84)
                          ->  Subquery Scan *SELECT* 1  (cost=1.20..9674.89
rows=23724 width=84)
                                ->  Hash Join  (cost=1.20..9674.89 rows=23724
width=84)
                                      ->  Seq Scan on pubacs_trans_log a
(cost=0.00..8695.30 rows=24455 width=60)
                                      ->  Hash  (cost=1.16..1.16 rows=16
width=24)
                                            ->  Seq Scan on addtypelong tl
(cost=0.00..1.16 rows=16 width=24)
                          ->  Subquery Scan *SELECT* 2  (cost=1.20..478.50
rows=2462 width=84)
                                ->  Hash Join  (cost=1.20..478.50 rows=2462
width=84)
                                      ->  Seq Scan on mvr_trans_log b
(cost=0.00..378.61 rows=2462 width=60)
                                      ->  Hash  (cost=1.16..1.16 rows=16
width=24)
                                            ->  Seq Scan on addtypelong tl
(cost=0.00..1.16 rows=16 width=24)

EXPLAIN

I imagine the combination of UNION and ORDER BY causes the problem, since
Postgres has to locate all the rows that match the search criteria, merge them,
order them, then return the top 20...

Any suggestions? Did I forget to provide any data that would make things
clearer?

BTW: VACUUM ANALYZE is run nightly.

__________________________________________________
Do You Yahoo!?
Yahoo! Greetings - Send FREE e-cards for every occasion!
http://greetings.yahoo.com

Re: improving performance of UNION and ORDER BY

From
Jean-Luc Lachance
Date:
Chris,

I believe you can cut down the processing time by including the order by
and limit in each select:

(SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
union
(SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
ORDER BY 2 DESC, 4 LIMIT 20

Give it a try and let me know


JLL

P.S.

Question:  Why not have a single table with a field log_type in ( 'A',
'B') ???



Chris Gamache wrote:
>
> Three tables...
> [...]
>
> I imagine the combination of UNION and ORDER BY causes the problem, since
> Postgres has to locate all the rows that match the search criteria, merge them,
> order them, then return the top 20...
>
> Any suggestions? Did I forget to provide any data that would make things
> clearer?

Re: improving performance of UNION and ORDER BY

From
Chris Gamache
Date:
New Query...

(select
   a.username as "User",
   a.trans_date as "Date",
   tl.longtype as "Type",
   a.trans_data as "Query Data",
   a.trans_charge as "Charged",
   a.user_reference_id as "Reference ID"
 from a_trans_log a, addtypelong tl
 where (
   (tl.shorttype=a.trans_type) and
   (a.trans_date >= '12/31/01'::TIMESTAMP)
) order by 4 desc, 2 limit 20)
union
(select
   b.username as "User",
   b.trans_date as "Date",
   tl.longtype as "Type",
   b.trans_data as "Query Data",
   b.trans_charge as "Charged",
   b.user_reference_id as "Reference ID"
 from b_trans_log b, addtypelong tl
 where (
   (tl.shorttype=b.trans_type) and
   (b.trans_date >= '12/31/01'::TIMESTAMP)
) order by 4 desc, 2 limit 20)
order by 4 desc, 2 limit 20;

Limit  (cost=12674.47..12674.47 rows=4 width=84)
  ->  Sort  (cost=12674.47..12674.47 rows=4 width=84)
        ->  Unique  (cost=12673.83..12674.43 rows=4 width=84)
              ->  Sort  (cost=12673.83..12673.83 rows=40 width=84)
                    ->  Append  (cost=12055.58..12672.77 rows=40 width=84)
                          ->  Subquery Scan *SELECT* 1
(cost=12055.58..12055.58 rows=20 width=84)
                                ->  Limit  (cost=12055.58..12055.58 rows=20
width=84)
                                      ->  Sort  (cost=12055.58..12055.58
rows=23724 width=84)
                                            ->  Hash Join  (cost=1.20..9674.89
rows=23724 width=84)
                                                  ->  Seq Scan on a_trans_log a
 (cost=0.00..8695.30 rows=24455 width=60)
                                                  ->  Hash  (cost=1.16..1.16
rows=16 width=24)
                                                        ->  Seq Scan on
addtypelong tl  (cost=0.00..1.16 rows=16 width=24)
                          ->  Subquery Scan *SELECT* 2  (cost=617.19..617.19
rows=20 width=84)
                                ->  Limit  (cost=617.19..617.19 rows=20
width=84)
                                      ->  Sort  (cost=617.19..617.19 rows=2462
width=84)
                                            ->  Hash Join  (cost=1.20..478.50
rows=2462 width=84)
                                                  ->  Seq Scan on b_trans_log b
 (cost=0.00..378.61 rows=2462 width=60)
                                                  ->  Hash  (cost=1.16..1.16
rows=16 width=24)
                                                        ->  Seq Scan on
addtypelong tl  (cost=0.00..1.16 rows=16 width=24)

That cut the query down to 6 seconds. About 5 seconds longer than I would like
it. I'll take any performance increase, tho. I tried SET enable_seqscan=off;
And it takes 8 seconds. Strange indeed.

The reason I don't put the two tables together is that the tables don't have
the exact same structure... I suppose I could merge the two, but it would be
counterintuitive to logically apply some fields in table a to data collected
for table b. Plus, I do so many more operations on the single tables than the
joined tables that it wouldn't make good use of programming time to rewrite all
the single table ops.

I know... excuses, excuses! Thanks for the input. Any more ideas?

CG

--- Jean-Luc Lachance <jllachan@nsd.ca> wrote:
> Chris,
>
> I believe you can cut down the processing time by including the order by
> and limit in each select:
>
> (SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
> union
> (SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
> ORDER BY 2 DESC, 4 LIMIT 20
>
> Give it a try and let me know
>
>
> JLL
>
> P.S.
>
> Question:  Why not have a single table with a field log_type in ( 'A',
> 'B') ???
>
>
>
> Chris Gamache wrote:
> >
> > Three tables...
> > [...]
> >
> > I imagine the combination of UNION and ORDER BY causes the problem, since
> > Postgres has to locate all the rows that match the search criteria, merge
> them,
> > order them, then return the top 20...
> >
> > Any suggestions? Did I forget to provide any data that would make things
> > clearer?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)



__________________________________________________
Do You Yahoo!?
Yahoo! Sports - sign up for Fantasy Baseball
http://sports.yahoo.com

Re: improving performance of UNION and ORDER BY

From
Darren Ferguson
Date:
This is not always faster but sometimes when using Postgres i have found
it to be significantly see below marked ----->>

I have removed the join from the where clause and put it in the from
clause .

Probably wont make a difference but i have had 7 occassions when it have
been significantly faster and i mean cutting 9 seconds to 1 second etc

Hope this helps

Darren Ferguson

On Mon, 4 Mar 2002, Chris Gamache wrote:

> New Query...
>
> (select
>    a.username as "User",
>    a.trans_date as "Date",
>    tl.longtype as "Type",
>    a.trans_data as "Query Data",
>    a.trans_charge as "Charged",
>    a.user_reference_id as "Reference ID"
>  from a_trans_log a
--->>   join addtypelong tl on a.trans_type = t1.shorttype
>  where (
>    (a.trans_date >= '12/31/01'::TIMESTAMP)
> ) order by 4 desc, 2 limit 20)
> union
> (select
>    b.username as "User",
>    b.trans_date as "Date",
>    tl.longtype as "Type",
>    b.trans_data as "Query Data",
>    b.trans_charge as "Charged",
>    b.user_reference_id as "Reference ID"
>  from b_trans_log b
--->>   join addtypelong tl on b.trans_type = t1.shorttype
>  where (
>    (b.trans_date >= '12/31/01'::TIMESTAMP)
> ) order by 4 desc, 2 limit 20)
> order by 4 desc, 2 limit 20;
>
> Limit  (cost=12674.47..12674.47 rows=4 width=84)
>   ->  Sort  (cost=12674.47..12674.47 rows=4 width=84)
>         ->  Unique  (cost=12673.83..12674.43 rows=4 width=84)
>               ->  Sort  (cost=12673.83..12673.83 rows=40 width=84)
>                     ->  Append  (cost=12055.58..12672.77 rows=40 width=84)
>                           ->  Subquery Scan *SELECT* 1
> (cost=12055.58..12055.58 rows=20 width=84)
>                                 ->  Limit  (cost=12055.58..12055.58 rows=20
> width=84)
>                                       ->  Sort  (cost=12055.58..12055.58
> rows=23724 width=84)
>                                             ->  Hash Join  (cost=1.20..9674.89
> rows=23724 width=84)
>                                                   ->  Seq Scan on a_trans_log a
>  (cost=0.00..8695.30 rows=24455 width=60)
>                                                   ->  Hash  (cost=1.16..1.16
> rows=16 width=24)
>                                                         ->  Seq Scan on
> addtypelong tl  (cost=0.00..1.16 rows=16 width=24)
>                           ->  Subquery Scan *SELECT* 2  (cost=617.19..617.19
> rows=20 width=84)
>                                 ->  Limit  (cost=617.19..617.19 rows=20
> width=84)
>                                       ->  Sort  (cost=617.19..617.19 rows=2462
> width=84)
>                                             ->  Hash Join  (cost=1.20..478.50
> rows=2462 width=84)
>                                                   ->  Seq Scan on b_trans_log b
>  (cost=0.00..378.61 rows=2462 width=60)
>                                                   ->  Hash  (cost=1.16..1.16
> rows=16 width=24)
>                                                         ->  Seq Scan on
> addtypelong tl  (cost=0.00..1.16 rows=16 width=24)
>
> That cut the query down to 6 seconds. About 5 seconds longer than I would like
> it. I'll take any performance increase, tho. I tried SET enable_seqscan=off;
> And it takes 8 seconds. Strange indeed.
>
> The reason I don't put the two tables together is that the tables don't have
> the exact same structure... I suppose I could merge the two, but it would be
> counterintuitive to logically apply some fields in table a to data collected
> for table b. Plus, I do so many more operations on the single tables than the
> joined tables that it wouldn't make good use of programming time to rewrite all
> the single table ops.
>
> I know... excuses, excuses! Thanks for the input. Any more ideas?
>
> CG
>
> --- Jean-Luc Lachance <jllachan@nsd.ca> wrote:
> > Chris,
> >
> > I believe you can cut down the processing time by including the order by
> > and limit in each select:
> >
> > (SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
> > union
> > (SELECT ... ORDER BY 2 DESC, 4 LIMIT 20)
> > ORDER BY 2 DESC, 4 LIMIT 20
> >
> > Give it a try and let me know
> >
> >
> > JLL
> >
> > P.S.
> >
> > Question:  Why not have a single table with a field log_type in ( 'A',
> > 'B') ???
> >
> >
> >
> > Chris Gamache wrote:
> > >
> > > Three tables...
> > > [...]
> > >
> > > I imagine the combination of UNION and ORDER BY causes the problem, since
> > > Postgres has to locate all the rows that match the search criteria, merge
> > them,
> > > order them, then return the top 20...
> > >
> > > Any suggestions? Did I forget to provide any data that would make things
> > > clearer?
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Sports - sign up for Fantasy Baseball
> http://sports.yahoo.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>


Re: improving performance of UNION and ORDER BY

From
Stephan Szabo
Date:
On Mon, 4 Mar 2002, Chris Gamache wrote:

> That cut the query down to 6 seconds. About 5 seconds longer than I would like
> it. I'll take any performance increase, tho. I tried SET enable_seqscan=off;
> And it takes 8 seconds. Strange indeed.
>
> The reason I don't put the two tables together is that the tables don't have
> the exact same structure... I suppose I could merge the two, but it would be
> counterintuitive to logically apply some fields in table a to data collected
> for table b. Plus, I do so many more operations on the single tables than the
> joined tables that it wouldn't make good use of programming time to rewrite all
> the single table ops.
>
> I know... excuses, excuses! Thanks for the input. Any more ideas?

Are there going to possibly be equal rows in the two parts that you need
to merge into one row? If not, try union all which should get rid of a
sort and unique I think.


Re: improving performance of UNION and ORDER BY

From
Chris Gamache
Date:
I think I've found an acceptable compromise. The optimizer will use the index
if I limit each of the subqueries to no more than 6200 rows. I think a
composite total ~12000 is enough data, understanding that the users of this
query would be silly to use it for paging through more than a week's data. I
think it'd be silly to go beyond 24 hours... Anyway, if they're looking for
something more specific, I take out the limits to allow it search through the
entire database, and unless they're not using specific enough search criteria
the query will still move rather quickly. A decent compromise.

Thanks for helping me nail this one down!

CG

--- Stephan Szabo <sszabo@megazone23.bigpanda.com> wrote:
> On Wed, 6 Mar 2002, Chris Gamache wrote:
>
> > UNION ALL was an excellent idea! It didn't cut much time off, but at least
> no
> > resources are devoted to eliminating the nonexistant duplicate rows.
> >
> > I've had days to think about this. It seems as though the ORDER BY part of
> the
> > first query is the culprit. When I run this query by itself, I can see that
> it
> > would comprise the bulk of the UNION query time.
> >
> > select
> >    a.username as "User",
> >    a.trans_date as "Date",
> >    tl.longtype as "Type",
> >    a.trans_data as "Query Data",
> >    a.trans_charge as "Charged",
> >    a.user_reference_id as "Reference ID"
> >  from a_trans_log a
> >  join addtypelong tl on a.trans_type = tl.shorttype
> >  where a.trans_date >= '12/31/01'::TIMESTAMP
> >  order by a.trans_date desc, a.trans_data limit 20;
> >
> > By removing the ORDER BY a.trans_data, it cut the query down to the "almost
> > instant" level... EXPLAIN shows me that it uses the indeces! I guess I need
> to
> > drop that part of the ORDER BY, or make an index for it to use... Bah.
>
> Yeah, although since you're doing different kinds of scans on the two
> columns (desc and asc), I'm not sure what'd be necessary to get the index
> used.  Normally you'd just make an index on trans_date,trans_data, but
> that won't work here.  I think Tom Lane may have discussed a way to get
> the index scan for that case sometime in the past couple of months on the
> mailing list.
>
> > Alas...
> > Unless someone knows different, I don't believe that I can use the LIMIT
> > statement in each of the subqueries. The app needs to page through the
> > UNIONized table. As the app pages through LIMIT 20,20 LIMIT 20,40 LIMIT
> 20,60,
> > etc. It needs to look at the whole sorted UNIONized table. (am I making ANY
> > sense?)
>
> That makes sense.  You could cheat a little if you knew how large the
> results sets for the parts were if you're using union all since I believe
> it will do them in order and just append the sets, but that doesn't seem
> like a good idea in general.
>
> > Even by dropping the order on column 4, it still takes 6 seconds to
> assemble
> > the data, sort it and limit it... Could I create a cross-table index
> > specifically for this query? I doubt its a) possible, b) (even if possible)
> a
> > good idea.
>
> I don't think so.
>
> One question is whether some of this could be done by keeping another
> table around that you do your limits through.  This would at least cut
> down some of the cost (or for that matter if you could make a table with
> this using triggers and such).
>
>


__________________________________________________
Do You Yahoo!?
Try FREE Yahoo! Mail - the world's greatest free email!
http://mail.yahoo.com/