Thread: How to optimize a query...

How to optimize a query...

From
secret
Date:
    I originally had this query with an OR clause instead of the UNION,
which made it amazingly impossibly slow, with the UNION it's far faster
however it's still very slow,(10,000+ rows in both stables), is there a
way to speed this up?  I have BTREE indexes on all concerned fields of
the tables, but it doesn't seem to use any of them...

    I have another query that joins the result of this with 5 other
tables, unfortunately that one takes like 10 minutes...  Any suggestions
on optimizations would be very appreciated.

ftc=> explain
ftc-> select po_id,ticket_pk
ftc-> from tickets,po
ftc-> where po_id=material_po
ftc-> union
ftc-> select po_id,ticket_pk
ftc-> from tickets,po
ftc-> where po_id=trucking_po
ftc-> ;
NOTICE:  QUERY PLAN:

Unique  (cost=4744.05 size=0 width=0)
  ->  Sort  (cost=4744.05 size=0 width=0)
        ->  Append  (cost=4744.05 size=0 width=0)
                ->  Hash Join  (cost=2372.03 size=11659303 width=12)
                      ->  Seq Scan on tickets  (cost=849.03 size=19213
width=8)
                      ->  Hash  (cost=0.00 size=0 width=0)
                            ->  Seq Scan on po  (cost=528.98 size=10848
width=4)
                ->  Hash Join  (cost=2372.03 size=13838477 width=12)
                      ->  Seq Scan on tickets  (cost=849.03 size=19213
width=8)
                      ->  Hash  (cost=0.00 size=0 width=0)
                            ->  Seq Scan on po  (cost=528.98 size=10848
width=4)

EXPLAIN


Re: [SQL] How to optimize a query...

From
"Ross J. Reedstrom"
Date:
secret wrote:
<description of sloq query, with proper EXPLAIN output... thanks!>

Hmm, it looks like your tables are big enough to benefit from indices,
but the query plan didn't choose any. I'd guess an index on po.po_id,
and perhaps tickets.material_po and tickets.trucking_po might help.

From what I understand, indices are of most use on attributes (fields)
that are mostly unique, like your po_id is (probably). They can hurt you
if you build them on fields that are mostly _not_ unique (the classic
example being a gender field - only two likely values, lost of rows
returned for either, so you scan anyway...)

Any real experts out there to correct me? ;-)

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm@rice.edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St.,  Houston, TX 77005

Re: How to optimize a query...

From
Tom Lane
Date:
>     I originally had this query with an OR clause instead of the UNION,
> which made it amazingly impossibly slow, with the UNION it's far faster
> however it's still very slow,(10,000+ rows in both stables), is there a
> way to speed this up?

Hmm.  What does EXPLAIN give as the query plan if you use the OR
approach?  The UNION method requires much more work than a decent OR
plan would need; as you can see from the query plan, it has to sort and
unique-ify the results of the two sub-selects in order to discard
duplicate tuples.  If you don't mind seeing the same tuple twice when
it matches on both PO fields, you could use UNION ALL rather than UNION
to avoid the sort step.  But that's just a hack...

I think the problem might be that Postgres doesn't know how to handle
ORed join clauses very efficiently.  Something to work on in the future.

>     I have another query that joins the result of this with 5 other
> tables, unfortunately that one takes like 10 minutes...

How long does it take to EXPLAIN that query?  Postgres' optimizer has
some bugs that cause it to take an unreasonable amount of time to plan
a query that joins more than a few tables.  You can tell whether it's
the planning or execution time that's the problem by comparing EXPLAIN
runtime to the actual query.  (The bugs are fixed in 6.5, btw.  In the
meantime a workaround is to reduce the GEQO threshold to less than the
number of tables in the query that's giving you trouble.)

            regards, tom lane

Re: [SQL] How to optimize a query...

From
secret
Date:
secret wrote:

>     I originally had this query with an OR clause instead of the UNION,
> which made it amazingly impossibly slow, with the UNION it's far faster
> however it's still very slow,(10,000+ rows in both stables), is there a
> way to speed this up?  I have BTREE indexes on all concerned fields of
> the tables, but it doesn't seem to use any of them...
>
>     I have another query that joins the result of this with 5 other
> tables, unfortunately that one takes like 10 minutes...  Any suggestions
> on optimizations would be very appreciated.
>
> ftc=> explain
> ftc-> select po_id,ticket_pk
> ftc-> from tickets,po
> ftc-> where po_id=material_po
> ftc-> union
> ftc-> select po_id,ticket_pk
> ftc-> from tickets,po
> ftc-> where po_id=trucking_po
> ftc-> ;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=4744.05 size=0 width=0)
>   ->  Sort  (cost=4744.05 size=0 width=0)
>         ->  Append  (cost=4744.05 size=0 width=0)
>                 ->  Hash Join  (cost=2372.03 size=11659303 width=12)
>                       ->  Seq Scan on tickets  (cost=849.03 size=19213
> width=8)
>                       ->  Hash  (cost=0.00 size=0 width=0)
>                             ->  Seq Scan on po  (cost=528.98 size=10848
> width=4)
>                 ->  Hash Join  (cost=2372.03 size=13838477 width=12)
>                       ->  Seq Scan on tickets  (cost=849.03 size=19213
> width=8)
>                       ->  Hash  (cost=0.00 size=0 width=0)
>                             ->  Seq Scan on po  (cost=528.98 size=10848
> width=4)
>
> EXPLAIN

    I'm having a lot of problems with performance under PostgreSQL, it seems
most of my major queries arn't using indexes(as above)... tickets has 15k
tables, 10k tables... The above query takes about 4 minutes...

    Unfortunately I need to add in joins to about 5 tables.... If I add one
more table it's 10 minutes... 2?  Who knows, I know by 5 it's unacceptable,
can anyone please give me some advise on what my problems might be with
queries?  I've done VACUUM ANALYZE ... I didn't used to do that, could it be
missing data from way back?  Should I reload the entire database?

--David


Tom Lane wrote:

> >     I originally had this query with an OR clause instead of the UNION,
> > which made it amazingly impossibly slow, with the UNION it's far faster
> > however it's still very slow,(10,000+ rows in both stables), is there a
> > way to speed this up?
>
> Hmm.  What does EXPLAIN give as the query plan if you use the OR
> approach?  The UNION method requires much more work than a decent OR
> plan would need; as you can see from the query plan, it has to sort and
> unique-ify the results of the two sub-selects in order to discard
> duplicate tuples.  If you don't mind seeing the same tuple twice when
> it matches on both PO fields, you could use UNION ALL rather than UNION
> to avoid the sort step.  But that's just a hack...
>

    Explain seems to be telling a different story now...  It used to give a
bunch of
sequential scans, when I used UNION it gave less hash joins, I thought the
less
hash joins might be better(And they were, one would take hours & run out of
memory
the other would take an hour & not run out of memory)... I have a stripped
down version
of the query I need, the UNION takes:
! system usage stats:
!       938.603740 elapsed 914.250000 user 1.140000 system sec
!       [914.260000 user 1.140000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       3799/13766 [3915/13873] page faults/reclaims, 654 [654] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:          0 read,          0 written, buffer hit rate =
100.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.00%
!       Direct blocks:        688 read,        716 written

When I use OR:
! system usage stats:
!       706.158704 elapsed 698.970000 user 0.370000 system sec
!       [698.980000 user 0.370000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       1025/6810 [1062/6906] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:        232 read,          0 written, buffer hit rate =
84.61%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.00%
!       Direct blocks:          0 read,          0 written


>
> I think the problem might be that Postgres doesn't know how to handle
> ORed join clauses very efficiently.  Something to work on in the future.
>

    I think that may of been because I didn' tknow to VACUUM ANALYZE,
however today I found a much more disturbing thing, if I simply change the
order of the SELECT clause I can change a 9.69 second query into a 605 second
query,
below is the example:

SELECT po_id,                                 <- PK of po
material.name,                                    <- Notice material name
here.
ticket_pk                                            <- PK of tickets

FROM po,tickets,material
WHERE   po_id=material_po
                AND po.units_id=tickets.units_id
                AND po.material_id=material.material_id
;
! system usage stats:
!       315.427223 elapsed 315.010000 user 0.250000 system sec
!       [315.020000 user 0.250000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       645/6717 [682/6812] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:        144 read,          0 written, buffer hit rate =
89.66%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.00%
!       Direct blocks:          0 read,          0 written


select po_id,ticket_pk,material.name

FROM po,tickets,material

WHERE   po_id=material_po AND
        po.units_id=tickets.units_id AND
        po.material_id=material.material_id;

! system usage stats:
!       2.290408 elapsed 1.890000 user 0.250000 system sec
!       [1.900000 user 0.250000 sys total]
!       0/0 [0/0] filesystem blocks in/out
!       73/6691 [110/6786] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
!       0/0 [0/0] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:          0 read,          0 written, buffer hit rate =
10
0.00%
!       Local  blocks:          0 read,          0 written, buffer hit rate =
0.
00%
!       Direct blocks:          0 read,          0 written

    315 seconds vs 2.3 seconds?  Is there some light you can shed on
how queries are processed under PostgreSQL?  They both have the
exact same EXPLAIN:

Hash Join  (cost=5773.47 size=24494486 width=36)
  ->  Hash Join  (cost=899.69 size=101128 width=26)
        ->  Seq Scan on po  (cost=530.38 size=10860 width=10)
        ->  Hash  (cost=0.00 size=0 width=0)
              ->  Seq Scan on material  (cost=5.47 size=105 width=16)
  ->  Hash  (cost=0.00 size=0 width=0)
        ->  Seq Scan on tickets  (cost=878.78 size=19872 width=10)

    The only GUESS I can come up with is that It's joining po & material then
tickets,
vs po & tickets then material, the latter one would be more efficient, since
po & material
are going to knock out lots from po, so there would be less joining against
material which
will have a component in material for all po.  I have indexes on all involved
rows.

>
> >     I have another query that joins the result of this with 5 other
> > tables, unfortunately that one takes like 10 minutes...
>
> How long does it take to EXPLAIN that query?  Postgres' optimizer has
> some bugs that cause it to take an unreasonable amount of time to plan
> a query that joins more than a few tables.  You can tell whether it's
> the planning or execution time that's the problem by comparing EXPLAIN
> runtime to the actual query.  (The bugs are fixed in 6.5, btw.  In the
> meantime a workaround is to reduce the GEQO threshold to less than the
> number of tables in the query that's giving you trouble.)
>
>                         regards, tom lane

    The explain takes a few seconds, maybe 10... It's definately a runtime
thing,
one time I ran the silly thing and it sucked up 192M of RAM(64M Physical +
128M
swap) then died... Sigh.  I'm slowly working my way back up to it, I just
bought
128M extra RAM... If you could shed some light on the extreme performance
difference for a simple change of order in the SELECT part of it I'd
appreciate it....
I don't think PostgreSQL is using my indexes very well, or at all... :(

    Sorry about the length of the message, I wanted to include as much detail

as possible.

David Secret
MIS Director
Kearney Development Co., Inc.