Thread: Re: [HACKERS] please help on query

Re: [HACKERS] please help on query

From
Manfred Koizar
Date:
[moving to pgsql-sql]
On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:
>I can't improve performance on this query:
>
>SELECT
> supplier.name,
> supplier.address
>FROM
> supplier,
> nation
>WHERE
> supplier.suppkey IN(
>  SELECT
>   partsupp.suppkey
>  FROM
>   partsupp
>  WHERE
>   partsupp.partkey IN(
>    SELECT
>     part.partkey
>    FROM
>     part
>    WHERE
>     part.name like 'forest%'
>     )
>   AND partsupp.availqty>(
>    SELECT
>     0.5*(sum(lineitem.quantity)::FLOAT)
>    FROM
>     lineitem
>    WHERE
>     lineitem.partkey=partsupp.partkey
>     AND lineitem.suppkey=partsupp.partkey                                   ^^^^^^^
suppkey???
 
>     AND lineitem.shipdate>=('1994-01-01')::DATE
>     AND lineitem.shipdate<(('1994-01-01')::DATE+('1 year')::INTERVAL)::DATE
>     )
>  )
> AND supplier.nationkey=nation.nationkey
> AND nation.name='CANADA'
>ORDER BY
> supplier.name;

Luis,
rules of thumb: "Avoid subselects; use joins!" and "If you have to use
subselects, avoid IN, use EXISTS!"

Let's try.  If partkey is unique in part, then
|   FROM partsupp
|   WHERE partsupp.partkey IN (SELECT part.partkey

can be replaced by   FROM partsupp ps, part p   WHERE ps.partkey = p.partkey

or   partsupp ps INNER JOIN part p     ON (ps.partkey = p.partkey AND p.name LIKE '...')

When we ignore "part" for now, your subselect boils down to

|  SELECT partsupp.suppkey
|  FROM partsupp
|  WHERE partsupp.availqty > (
|    SELECT 0.5*(sum(lineitem.quantity)::FLOAT)
|    FROM lineitem
|    WHERE lineitem.partkey=partsupp.partkey
|      AND lineitem.suppkey=partsupp.suppkey
|      AND lineitem.shipdate BETWEEN ... AND ...
|     )

which can be rewritten to (untested)
 SELECT ps.suppkey FROM partsupp ps, lineitem li WHERE li.partkey=ps.partkey   AND li.suppkey=ps.suppkey   AND
lineitem.shipdateBETWEEN ... AND ... GROUP BY ps.partkey, ps.suppkey HAVING min(ps.availqty) >
0.5*(sum(lineitem.quantity)::FLOAT)       ^^^        As all ps.availqty are equal in one group, you can as well
 
use max() or avg().

Now we have left only one IN:
| WHERE supplier.suppkey IN (
|  SELECT partsupp.suppkey FROM partsupp WHERE <condition> )

Being to lazy to find out, if this can be rewritten to a join, let`s
apply rule 2 here:
 WHERE EXISTS (   SELECT ... FROM partsupp ps   WHERE supplier.suppkey = ps.suppkey     AND <condition> )

HTH, but use with a grain of salt ...

>Sort  (cost=2777810917708.17..2777810917708.17 rows=200 width=81)            ^^^^^^^^^^^^^^^^
BTW, how many years are these? :-)

ServusManfred


Re: [HACKERS] please help on query

From
Manfred Koizar
Date:
On Thu, 11 Jul 2002 19:40:46 +0200, "Luis Alberto Amigo Navarro"
<lamigo@atc.unican.es> wrote:
>I've tried
[reformatted to fit on one page]
| SELECT supplier.name, supplier.address
| FROM supplier, nation, lineitem
You already found out that you do not need lineitem here.

| WHERE EXISTS(
|   SELECT partsupp.suppkey
|   FROM partsupp,lineitem
|   WHERE
|    lineitem.partkey=partsupp.partkey
|    AND lineitem.suppkey=partsupp.partkey
I still don't believe this  suppkey=partkey

|    AND lineitem.shipdate [...]
|    AND EXISTS( SELECT part.partkey
|                FROM part WHERE part.name like 'forest%')
This subselect gives either true or false, but in any case always the
same result.  You might want to add a conditionAND part.partkey=partsupp.partkey

Are you sure partkey is not unique?  If it is unique you can replace
this subselect by a join.

|   GROUP BY partsupp.partkey,partsupp.suppkey
|   HAVING min(availqty)>(0.5*(sum(lineitem.quantity)::FLOAT))
|   )
|  AND supplier.nationkey=nation.nationkey
|  AND nation.name='CANADA'
| ORDER BY supplier.name;

>as you said and something is wrong
>Sort  (cost=1141741215.35..1141741215.35 rows=2400490000 width=81)

The cost is now only 1141741215.35 compared to 2777810917708.17
before;  this is an improvement factor of more than 2000.  So what's
your problem? ;-)

ServusManfred


Re: [HACKERS] please help on query

From
"Luis Alberto Amigo Navarro"
Date:


> The cost is now only 1141741215.35 compared to 2777810917708.17
> before;  this is an improvement factor of more than 2000.  So what's
> your problem? ;-)
>
> Servus
>  Manfred
>
In fact planner is estimating incredibly badly, it took only 833msecs now
runs perfectly

I'm going to keep on asking about another query:

SELECTcustomer.name,customer.custkey,orders.orderkey,orders.orderdate,orders.totalprice,sum(lineitem.quantity)
FROMcustomer,orders,lineitem
WHEREexists( SELECT  lineitem.orderkey FROM  lineitem WHERE  lineitem.orderkey=orders.orderkey GROUP BY
lineitem.orderkeyHAVING  sum(lineitem.quantity)>300 )AND customer.custkey=orders.custkeyAND
orders.orderkey=lineitem.orderkey
GROUP BYcustomer.name,customer.custkey,orders.orderkey,orders.orderdate,orders.totalprice

ORDER BYorders.totalprice DESC,orders.orderdate;

NOTICE:  QUERY PLAN:

Sort  (cost=26923941.97..26923941.97 rows=300061 width=66) ->  Aggregate  (cost=26851634.86..26896644.05 rows=300061
width=66)      ->  Group  (cost=26851634.86..26889142.52 rows=3000612 width=66)             ->  Sort
(cost=26851634.86..26851634.86rows=3000612
 
width=66)                   ->  Hash Join  (cost=26107574.81..26457309.10
rows=3000612 width=66)                         ->  Seq Scan on lineitem  (cost=0.00..222208.25
rows=6001225 width=8)                         ->  Hash  (cost=26105699.81..26105699.81
rows=750000 width=58)                               ->  Hash Join  (cost=7431.00..26105699.81
rows=750000 width=58)                                     ->  Seq Scan on orders
(cost=0.00..26083268.81 rows=750000 width=25)                                           SubPlan
                   ->  Aggregate
 
(cost=0.00..17.35 rows=1 width=8)                                                   ->  Group
(cost=0.00..17.34 rows=5 width=8)                                                         ->  Index Scan
using lineitem_pkey on lineitem  (cost=0.00..17.33 rows=5 width=8)                                     ->  Hash
(cost=7056.00..7056.00
rows=150000 width=33)                                           ->  Seq Scan on customer
(cost=0.00..7056.00 rows=150000 width=33)

again:
orders 1500000 tuples
lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
Customer 150000 tuples

select attname,n_distinct,correlation from pg_stats where
tablename='lineitem';   attname    | n_distinct | correlation
---------------+------------+-------------orderkey      |  -0.199847 |           1partkey       |     196448 |
0.0223377suppkey      |       9658 | -0.00822751linenumber    |          7 |     0.17274quantity      |         50 |
0.0150153extendedprice|      25651 | -0.00790245discount      |         11 |    0.103761tax           |          9 |
0.0993771returnflag   |          3 |    0.391434linestatus    |          2 |    0.509791shipdate      |       2440 |
0.0072777commitdate   |       2497 |  0.00698162receiptdate   |       2416 |  0.00726686shipinstruct  |          4 |
0.241511shipmode     |          7 |    0.138432comment       |     275488 |   0.0188006
 
(16 rows)

select attname,n_distinct,correlation from pg_stats where
tablename='orders';   attname    | n_distinct | correlation
---------------+------------+-------------orderkey      |         -1 |   -0.999925custkey       |      76309 |
0.00590596orderstatus  |          3 |    0.451991totalprice    |         -1 | -0.00768806orderdate     |       2431 |
-0.0211354orderpriority|          5 |    0.182489clerk         |       1009 |  0.00546939shippriority  |          1 |
       1comment       |  -0.750125 |  -0.0123887
 

Customer attname   | n_distinct | correlation
------------+------------+-------------custkey    |         -1 |           1name       |         -1 |
1address   |         -1 | -0.00510274nationkey  |         25 |   0.0170533phone      |         -1 |  -0.0227816acctbal
 |   -0.83444 | -0.00220958mktsegment |          5 |    0.205013comment    |         -1 |   0.0327827
 

This query takes 12 minutes to run and returns about 50 customers.
lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
restrictive

May someone help on improving performance?
Again thanks in advance
Regards




Re: [HACKERS] please help on query

From
Jakub Ouhrabka
Date:
hi,

avoid subselect: create a temp table and use join...

CREATE TEMP TABLE tmp AS  SELECT   lineitem.orderkey  FROM   lineitem  WHERE   lineitem.orderkey=orders.orderkey  GROUP
BY  lineitem.orderkey HAVING   sum(lineitem.quantity)>300;
 

CREATE INDEX tmp_idx ON tmp (orderkey);
SELECT customer.name, customer.custkey, orders.orderkey, orders.orderdate, orders.totalprice,
sum(lineitem.quantity)FROMcustomer, orders, lineitem, tmpWHERE orders.orderkey=tmp.orderkey AND
customer.custkey=orders.custkeyAND orders.orderkey=lineitem.orderkeyGROUP BY customer.name, customer.custkey,
orders.orderkey,orders.orderdate, orders.totalpriceORDER BY orders.totalprice DESC, orders.orderdate;
 


may be the index is not necessary...

kuba


> I'm going to keep on asking about another query:
>
> SELECT
>  customer.name,
>  customer.custkey,
>  orders.orderkey,
>  orders.orderdate,
>  orders.totalprice,
>  sum(lineitem.quantity)
> FROM
>  customer,
>  orders,
>  lineitem
> WHERE
>  exists(
>   SELECT
>    lineitem.orderkey
>   FROM
>    lineitem
>   WHERE
>    lineitem.orderkey=orders.orderkey
>   GROUP BY
>    lineitem.orderkey HAVING
>    sum(lineitem.quantity)>300
>   )
>  AND customer.custkey=orders.custkey
>  AND orders.orderkey=lineitem.orderkey
> GROUP BY
>  customer.name,
>  customer.custkey,
>  orders.orderkey,
>  orders.orderdate,
>  orders.totalprice
>
> ORDER BY
>  orders.totalprice DESC,
>  orders.orderdate;
>
> NOTICE:  QUERY PLAN:
>
> Sort  (cost=26923941.97..26923941.97 rows=300061 width=66)
>   ->  Aggregate  (cost=26851634.86..26896644.05 rows=300061 width=66)
>         ->  Group  (cost=26851634.86..26889142.52 rows=3000612 width=66)
>               ->  Sort  (cost=26851634.86..26851634.86 rows=3000612
> width=66)
>                     ->  Hash Join  (cost=26107574.81..26457309.10
> rows=3000612 width=66)
>                           ->  Seq Scan on lineitem  (cost=0.00..222208.25
> rows=6001225 width=8)
>                           ->  Hash  (cost=26105699.81..26105699.81
> rows=750000 width=58)
>                                 ->  Hash Join  (cost=7431.00..26105699.81
> rows=750000 width=58)
>                                       ->  Seq Scan on orders
> (cost=0.00..26083268.81 rows=750000 width=25)
>                                             SubPlan
>                                               ->  Aggregate
> (cost=0.00..17.35 rows=1 width=8)
>                                                     ->  Group
> (cost=0.00..17.34 rows=5 width=8)
>                                                           ->  Index Scan
> using lineitem_pkey on lineitem  (cost=0.00..17.33 rows=5 width=8)
>                                       ->  Hash  (cost=7056.00..7056.00
> rows=150000 width=33)
>                                             ->  Seq Scan on customer
> (cost=0.00..7056.00 rows=150000 width=33)
>
> again:
> orders 1500000 tuples
> lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
> Customer 150000 tuples
>
> select attname,n_distinct,correlation from pg_stats where
> tablename='lineitem';
>     attname    | n_distinct | correlation
> ---------------+------------+-------------
>  orderkey      |  -0.199847 |           1
>  partkey       |     196448 |   0.0223377
>  suppkey       |       9658 | -0.00822751
>  linenumber    |          7 |     0.17274
>  quantity      |         50 |   0.0150153
>  extendedprice |      25651 | -0.00790245
>  discount      |         11 |    0.103761
>  tax           |          9 |   0.0993771
>  returnflag    |          3 |    0.391434
>  linestatus    |          2 |    0.509791
>  shipdate      |       2440 |   0.0072777
>  commitdate    |       2497 |  0.00698162
>  receiptdate   |       2416 |  0.00726686
>  shipinstruct  |          4 |    0.241511
>  shipmode      |          7 |    0.138432
>  comment       |     275488 |   0.0188006
> (16 rows)
>
> select attname,n_distinct,correlation from pg_stats where
> tablename='orders';
>     attname    | n_distinct | correlation
> ---------------+------------+-------------
>  orderkey      |         -1 |   -0.999925
>  custkey       |      76309 |  0.00590596
>  orderstatus   |          3 |    0.451991
>  totalprice    |         -1 | -0.00768806
>  orderdate     |       2431 |  -0.0211354
>  orderpriority |          5 |    0.182489
>  clerk         |       1009 |  0.00546939
>  shippriority  |          1 |           1
>  comment       |  -0.750125 |  -0.0123887
>
> Customer
>   attname   | n_distinct | correlation
> ------------+------------+-------------
>  custkey    |         -1 |           1
>  name       |         -1 |           1
>  address    |         -1 | -0.00510274
>  nationkey  |         25 |   0.0170533
>  phone      |         -1 |  -0.0227816
>  acctbal    |   -0.83444 | -0.00220958
>  mktsegment |          5 |    0.205013
>  comment    |         -1 |   0.0327827
>
> This query takes 12 minutes to run and returns about 50 customers.
> lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
> restrictive
>
> May someone help on improving performance?
> Again thanks in advance
> Regards
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>



Re: [HACKERS] please help on query

From
"Luis Alberto Amigo Navarro"
Date:
I've tried
SELECTsupplier.name,supplier.address
FROMsupplier,nation,lineitem
WHEREEXISTS( SELECT  partsupp.suppkey FROM  partsupp,lineitem WHERE  lineitem.partkey=partsupp.partkey  AND
lineitem.suppkey=partsupp.partkey AND lineitem.shipdate>=('1994-01-01')::DATE  AND
lineitem.shipdate<(('1994-01-01')::DATE+('1year')::INTERVAL)::DATE  AND EXISTS(   SELECT    part.partkey   FROM    part
 WHERE    part.name like 'forest%'    ) GROUP BY partsupp.partkey,partsupp.suppkey HAVING
min(availqty)>(0.5*(sum(lineitem.quantity)::FLOAT)))AND supplier.nationkey=nation.nationkeyAND nation.name='CANADA'
 
ORDER BYsupplier.name;

as you said and something is wrong
Sort  (cost=1141741215.35..1141741215.35 rows=2400490000 width=81) InitPlan   ->  Aggregate  (cost=0.00..921773.85
rows=48width=24)         InitPlan           ->  Seq Scan on part  (cost=0.00..8561.00 rows=1 width=4)         ->  Group
(cost=0.00..921771.44 rows=481 width=24)               ->  Result  (cost=0.00..921769.04 rows=481 width=24)
       ->  Merge Join  (cost=0.00..921769.04 rows=481
 
width=24)                           ->  Index Scan using partsupp_pkey on partsupp
(cost=0.00..98522.75 rows=800000 width=12)                           ->  Index Scan using lsupp_index on lineitem
(cost=0.00..821239.91 rows=145 width=12) ->  Result  (cost=1.31..112888690.31 rows=2400490000 width=81)       ->
NestedLoop  (cost=1.31..112888690.31 rows=2400490000 width=81)             ->  Hash Join  (cost=1.31..490.31 rows=400
width=81)                  ->  Seq Scan on supplier  (cost=0.00..434.00 rows=10000
 
width=77)                   ->  Hash  (cost=1.31..1.31 rows=1 width=4)                         ->  Seq Scan on nation
(cost=0.00..1.31rows=1
 
width=4)             ->  Seq Scan on lineitem  (cost=0.00..222208.25 rows=6001225
width=0)

where might be my mistake
Thanks and regards
----- Original Message -----
From: "Manfred Koizar" <mkoi-pg@aon.at>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Thursday, July 11, 2002 6:47 PM
Subject: Re: [HACKERS] please help on query


> [moving to pgsql-sql]
> On Thu, 11 Jul 2002 17:22:14 +0200, "Luis Alberto Amigo Navarro"
> <lamigo@atc.unican.es> wrote:
> >I can't improve performance on this query:
> >
> >SELECT
> > supplier.name,
> > supplier.address
> >FROM
> > supplier,
> > nation
> >WHERE
> > supplier.suppkey IN(
> >  SELECT
> >   partsupp.suppkey
> >  FROM
> >   partsupp
> >  WHERE
> >   partsupp.partkey IN(
> >    SELECT
> >     part.partkey
> >    FROM
> >     part
> >    WHERE
> >     part.name like 'forest%'
> >     )
> >   AND partsupp.availqty>(
> >    SELECT
> >     0.5*(sum(lineitem.quantity)::FLOAT)
> >    FROM
> >     lineitem
> >    WHERE
> >     lineitem.partkey=partsupp.partkey
> >     AND lineitem.suppkey=partsupp.partkey
>                                     ^^^^^^^
>                                     suppkey ???
> >     AND lineitem.shipdate>=('1994-01-01')::DATE
> >     AND lineitem.shipdate<(('1994-01-01')::DATE+('1
year')::INTERVAL)::DATE
> >     )
> >  )
> > AND supplier.nationkey=nation.nationkey
> > AND nation.name='CANADA'
> >ORDER BY
> > supplier.name;
>





Re: [HACKERS] please help on query

From
"Luis Alberto Amigo Navarro"
Date:
Lineitem is being modified on run time, so creating a temp table don't
solves my problem
The time of creating this table is the same of performing the subselect (or
so I think), it could be done creating a new table, and a new trigger, but
there are already triggers to calculate
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
unt) and to calculate orderstatus in order with linestatus and to calculate
orders.totalprice as sum(extendedprice) where
lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
sum(quantity) where orderkey=new.orderkey might be excessive.
Any other idea?
Thanks And Regards

----- Original Message -----
From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
Sent: Friday, July 12, 2002 1:50 PM
Subject: Re: [SQL] [HACKERS] please help on query


> hi,
>
> avoid subselect: create a temp table and use join...
>
> CREATE TEMP TABLE tmp AS
>    SELECT
>     lineitem.orderkey
>    FROM
>     lineitem
>    WHERE
>     lineitem.orderkey=orders.orderkey
>    GROUP BY
>     lineitem.orderkey HAVING
>     sum(lineitem.quantity)>300;
>
> CREATE INDEX tmp_idx ON tmp (orderkey);
>
>  SELECT
>   customer.name,
>   customer.custkey,
>   orders.orderkey,
>   orders.orderdate,
>   orders.totalprice,
>   sum(lineitem.quantity)
>  FROM
>   customer,
>   orders,
>   lineitem,
>   tmp
>  WHERE
>   orders.orderkey=tmp.orderkey
>   AND customer.custkey=orders.custkey
>   AND orders.orderkey=lineitem.orderkey
>  GROUP BY
>   customer.name,
>   customer.custkey,
>   orders.orderkey,
>   orders.orderdate,
>   orders.totalprice
>  ORDER BY
>   orders.totalprice DESC,
>   orders.orderdate;
>
>
> may be the index is not necessary...
>
> kuba
>
>
> > I'm going to keep on asking about another query:
> >
> > SELECT
> >  customer.name,
> >  customer.custkey,
> >  orders.orderkey,
> >  orders.orderdate,
> >  orders.totalprice,
> >  sum(lineitem.quantity)
> > FROM
> >  customer,
> >  orders,
> >  lineitem
> > WHERE
> >  exists(
> >   SELECT
> >    lineitem.orderkey
> >   FROM
> >    lineitem
> >   WHERE
> >    lineitem.orderkey=orders.orderkey
> >   GROUP BY
> >    lineitem.orderkey HAVING
> >    sum(lineitem.quantity)>300
> >   )
> >  AND customer.custkey=orders.custkey
> >  AND orders.orderkey=lineitem.orderkey
> > GROUP BY
> >  customer.name,
> >  customer.custkey,
> >  orders.orderkey,
> >  orders.orderdate,
> >  orders.totalprice
> >
> > ORDER BY
> >  orders.totalprice DESC,
> >  orders.orderdate;
> >
> > NOTICE:  QUERY PLAN:
> >
> > Sort  (cost=26923941.97..26923941.97 rows=300061 width=66)
> >   ->  Aggregate  (cost=26851634.86..26896644.05 rows=300061 width=66)
> >         ->  Group  (cost=26851634.86..26889142.52 rows=3000612 width=66)
> >               ->  Sort  (cost=26851634.86..26851634.86 rows=3000612
> > width=66)
> >                     ->  Hash Join  (cost=26107574.81..26457309.10
> > rows=3000612 width=66)
> >                           ->  Seq Scan on lineitem
(cost=0.00..222208.25
> > rows=6001225 width=8)
> >                           ->  Hash  (cost=26105699.81..26105699.81
> > rows=750000 width=58)
> >                                 ->  Hash Join
(cost=7431.00..26105699.81
> > rows=750000 width=58)
> >                                       ->  Seq Scan on orders
> > (cost=0.00..26083268.81 rows=750000 width=25)
> >                                             SubPlan
> >                                               ->  Aggregate
> > (cost=0.00..17.35 rows=1 width=8)
> >                                                     ->  Group
> > (cost=0.00..17.34 rows=5 width=8)
> >                                                           ->  Index Scan
> > using lineitem_pkey on lineitem  (cost=0.00..17.33 rows=5 width=8)
> >                                       ->  Hash  (cost=7056.00..7056.00
> > rows=150000 width=33)
> >                                             ->  Seq Scan on customer
> > (cost=0.00..7056.00 rows=150000 width=33)
> >
> > again:
> > orders 1500000 tuples
> > lineitem 6000000 tuples there are 1 to 7 lineitems per orderkey
> > Customer 150000 tuples
> >
> > select attname,n_distinct,correlation from pg_stats where
> > tablename='lineitem';
> >     attname    | n_distinct | correlation
> > ---------------+------------+-------------
> >  orderkey      |  -0.199847 |           1
> >  partkey       |     196448 |   0.0223377
> >  suppkey       |       9658 | -0.00822751
> >  linenumber    |          7 |     0.17274
> >  quantity      |         50 |   0.0150153
> >  extendedprice |      25651 | -0.00790245
> >  discount      |         11 |    0.103761
> >  tax           |          9 |   0.0993771
> >  returnflag    |          3 |    0.391434
> >  linestatus    |          2 |    0.509791
> >  shipdate      |       2440 |   0.0072777
> >  commitdate    |       2497 |  0.00698162
> >  receiptdate   |       2416 |  0.00726686
> >  shipinstruct  |          4 |    0.241511
> >  shipmode      |          7 |    0.138432
> >  comment       |     275488 |   0.0188006
> > (16 rows)
> >
> > select attname,n_distinct,correlation from pg_stats where
> > tablename='orders';
> >     attname    | n_distinct | correlation
> > ---------------+------------+-------------
> >  orderkey      |         -1 |   -0.999925
> >  custkey       |      76309 |  0.00590596
> >  orderstatus   |          3 |    0.451991
> >  totalprice    |         -1 | -0.00768806
> >  orderdate     |       2431 |  -0.0211354
> >  orderpriority |          5 |    0.182489
> >  clerk         |       1009 |  0.00546939
> >  shippriority  |          1 |           1
> >  comment       |  -0.750125 |  -0.0123887
> >
> > Customer
> >   attname   | n_distinct | correlation
> > ------------+------------+-------------
> >  custkey    |         -1 |           1
> >  name       |         -1 |           1
> >  address    |         -1 | -0.00510274
> >  nationkey  |         25 |   0.0170533
> >  phone      |         -1 |  -0.0227816
> >  acctbal    |   -0.83444 | -0.00220958
> >  mktsegment |          5 |    0.205013
> >  comment    |         -1 |   0.0327827
> >
> > This query takes 12 minutes to run and returns about 50 customers.
> > lineitem.quantity takes values from 1 to 50, so 300 per orderkey is very
> > restrictive
> >
> > May someone help on improving performance?
> > Again thanks in advance
> > Regards
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: Have you checked our extensive FAQ?
> >
> > http://www.postgresql.org/users-lounge/docs/faq.html
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>




Re: [HACKERS] please help on query

From
Masaru Sugawara
Date:
On Fri, 12 Jul 2002 17:32:50 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:


> Lineitem is being modified on run time, so creating a temp table don't
> solves my problem
> The time of creating this table is the same of performing the subselect (or
> so I think), it could be done creating a new table, and a new trigger, but
> there are already triggers to calculate
> lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
> unt) and to calculate orderstatus in order with linestatus and to calculate
> orders.totalprice as sum(extendedprice) where
> lineitem.orderkey=new.orderkey. A new trigger in order to insert orderkey if
> sum(quantity) where orderkey=new.orderkey might be excessive.
> Any other idea?
> Thanks And Regards
> 
> ----- Original Message -----
> From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
> To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
> Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
> Sent: Friday, July 12, 2002 1:50 PM
> Subject: Re: [SQL] [HACKERS] please help on query
> 
> >
> > avoid subselect: create a temp table and use join...
> >
> > CREATE TEMP TABLE tmp AS
> >    SELECT
> >     lineitem.orderkey
> >    FROM
> >     lineitem
> >    WHERE
> >     lineitem.orderkey=orders.orderkey
> >    GROUP BY
> >     lineitem.orderkey HAVING
> >     sum(lineitem.quantity)>300;


Hi,

I'm not sure whether its performance can be improved or not.  But I feel
there is a slight chance to reduce the total number of the tuples which 
Planner must think.

BTW, how much time does the following query take in your situation, 
and how many rows does it retrieve ?


EXPLAIN ANALYZE
SELECT       lineitem.orderkey   FROM       lineitem   GROUP BY       lineitem.orderkey   HAVING
SUM(lineitem.quantity)> 300;
 



Regards,
Masaru Sugawara




Re: [HACKERS] please help on query

From
"Luis Alberto Amigo Navarro"
Date:
----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Sunday, July 14, 2002 2:23 PM
Subject: Re: [SQL] [HACKERS] please help on query


This is the output:

Aggregate  (cost=0.00..647161.10 rows=600122 width=8) (actual
time=4959.19..347328.83 rows=62 loops=1) ->  Group  (cost=0.00..632158.04 rows=6001225 width=8) (actual
time=10.79..274259.16 rows=6001225 loops=1)       ->  Index Scan using lineitem_pkey on lineitem
(cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
rows=6001225 loops=1)
Total runtime: 347330.28 msec

it is returning all rows in lineitem. Why is it using index?
Thanks and regards


> On Fri, 12 Jul 2002 17:32:50 +0200
> "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:
>
>
> > Lineitem is being modified on run time, so creating a temp table don't
> > solves my problem
> > The time of creating this table is the same of performing the subselect
(or
> > so I think), it could be done creating a new table, and a new trigger,
but
> > there are already triggers to calculate
> >
lineitem.extendedprice=part.retailprice*lineitem.quantity*(1+taxes)*(1-disco
> > unt) and to calculate orderstatus in order with linestatus and to
calculate
> > orders.totalprice as sum(extendedprice) where
> > lineitem.orderkey=new.orderkey. A new trigger in order to insert
orderkey if
> > sum(quantity) where orderkey=new.orderkey might be excessive.
> > Any other idea?
> > Thanks And Regards
> >
> > ----- Original Message -----
> > From: "Jakub Ouhrabka" <jakub.ouhrabka@comgate.cz>
> > To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
> > Cc: "Manfred Koizar" <mkoi-pg@aon.at>; <pgsql-sql@postgresql.org>
> > Sent: Friday, July 12, 2002 1:50 PM
> > Subject: Re: [SQL] [HACKERS] please help on query
> >
> > >
> > > avoid subselect: create a temp table and use join...
> > >
> > > CREATE TEMP TABLE tmp AS
> > >    SELECT
> > >     lineitem.orderkey
> > >    FROM
> > >     lineitem
> > >    WHERE
> > >     lineitem.orderkey=orders.orderkey
> > >    GROUP BY
> > >     lineitem.orderkey HAVING
> > >     sum(lineitem.quantity)>300;
>
>
> Hi,
>
> I'm not sure whether its performance can be improved or not.  But I feel
> there is a slight chance to reduce the total number of the tuples which
> Planner must think.
>
> BTW, how much time does the following query take in your situation,
> and how many rows does it retrieve ?
>
>
> EXPLAIN ANALYZE
> SELECT
>         lineitem.orderkey
>     FROM
>         lineitem
>     GROUP BY
>         lineitem.orderkey
>     HAVING
>         SUM(lineitem.quantity) > 300;
>
>
>
> Regards,
> Masaru Sugawara
>
>
>




Re: [HACKERS] please help on query

From
Masaru Sugawara
Date:
On Mon, 15 Jul 2002 09:45:36 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:

> This is the output:
> 
> Aggregate  (cost=0.00..647161.10 rows=600122 width=8) (actual
> time=4959.19..347328.83 rows=62 loops=1)
>   ->  Group  (cost=0.00..632158.04 rows=6001225 width=8) (actual
> time=10.79..274259.16 rows=6001225 loops=1)
>         ->  Index Scan using lineitem_pkey on lineitem
> (cost=0.00..617154.97 rows=6001225 width=8) (actual time=10.77..162439.11
> rows=6001225 loops=1)
> Total runtime: 347330.28 msec
> 
> it is returning all rows in lineitem. Why is it using index?


Sorry, I don't know the reason. 
I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?


EXPLAIN ANALYZE
SELECT       orders.orderkey   FROM       lineitem LEFT OUTER JOIN       orders USING(orderkey)   WHERE
orders.orderkeyIS NOT NULL   GROUP BY       orders.orderkey   HAVING       SUM(lineitem.quantity) > 300;
 



EXPLAIN ANALYZE
SELECT     t2.*
FROM (SELECT                 orders.orderkey             FROM                  lineitem LEFT OUTER JOIN
orders USING(orderkey)            WHERE                  orders.orderkey IS NOT NULL           GROUP BY
orders.orderkey           HAVING                   SUM(lineitem.quantity) > 300          ) AS t1 LEFT OUTER JOIN
 orders AS t2 USING(orderkey)
 
ORDER BY t2.custkey           


Regards,
Masaru Sugawara




Re: [HACKERS] please help on query

From
"Luis Alberto Amigo Navarro"
Date:
----- Original Message -----
From: "Masaru Sugawara" <rk73@sea.plala.or.jp>
To: "Luis Alberto Amigo Navarro" <lamigo@atc.unican.es>
Cc: <pgsql-sql@postgresql.org>
Sent: Monday, July 15, 2002 6:15 PM
Subject: Re: [SQL] [HACKERS] please help on query



>
> Sorry, I don't know the reason.
> I need more info. Can you show me the outputs of EXPLAIN ANALYZE ?
>
Here it is:


>
> EXPLAIN ANALYZE
> SELECT
>         orders.orderkey
>     FROM
>         lineitem LEFT OUTER JOIN
>         orders USING(orderkey)
>     WHERE
>         orders.orderkey IS NOT NULL
>     GROUP BY
>         orders.orderkey
>     HAVING
>         SUM(lineitem.quantity) > 300;
>
Aggregate  (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
time=1236941.71..1454824.56 rows=62 loops=1) ->  Group  (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
time=1233968.87..1385034.91 rows=6001225 loops=1)       ->  Sort  (cost=1257368.92..1257368.92 rows=6001225 width=12)
(actual time=1233968.82..1276147.37 rows=6001225 loops=1)             ->  Hash Join  (cost=166395.00..520604.08
rows=6001225
width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)                   ->  Seq Scan on lineitem
(cost=0.00..195405.25
rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)                   ->  Hash
(cost=162645.00..162645.00rows=1500000
 
width=4) (actual time=59032.16..59032.16 rows=0 loops=1)                         ->  Seq Scan on orders
(cost=0.00..162645.00
rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
Total runtime: 1454929.11 msec




>
>
> EXPLAIN ANALYZE
> SELECT
>       t2.*
> FROM (SELECT
>                   orders.orderkey
>               FROM
>                    lineitem LEFT OUTER JOIN
>                    orders USING(orderkey)
>              WHERE
>                    orders.orderkey IS NOT NULL
>             GROUP BY
>                    orders.orderkey
>             HAVING
>                     SUM(lineitem.quantity) > 300
>            ) AS t1 LEFT OUTER JOIN
>            orders AS t2 USING(orderkey)
> ORDER BY t2.custkey
>

Sort  (cost=1739666.43..1739666.43 rows=600122 width=119) (actual
time=1538897.23..1538897.47 rows=62 loops=1) ->  Merge Join  (cost=1344971.49..1682069.98 rows=600122 width=119)
(actual time=1440886.58..1538886.03 rows=62 loops=1)       ->  Index Scan using orders_pkey on orders t2
(cost=0.00..324346.65
rows=1500000 width=115) (actual time=32.80..87906.98 rows=1455276 loops=1)       ->  Sort  (cost=1344971.49..1344971.49
rows=600122width=12) (actual
 
time=1439550.31..1439550.73 rows=62 loops=1)             ->  Subquery Scan t1  (cost=1257368.92..1287375.04
rows=600122
width=12) (actual time=1222560.86..1439549.36 rows=62 loops=1)                   ->  Aggregate
(cost=1257368.92..1287375.04rows=600122
 
width=12) (actual time=1222560.84..1439548.42 rows=62 loops=1)                         ->  Group
(cost=1257368.92..1272371.98
rows=6001225 width=12) (actual time=1219607.04..1369327.42 rows=6001225
loops=1)                               ->  Sort  (cost=1257368.92..1257368.92
rows=6001225 width=12) (actual time=1219607.00..1261208.08 rows=6001225
loops=1)                                     ->  Hash Join
(cost=166395.00..520604.08 rows=6001225 width=12) (actual
time=65973.31..769253.41 rows=6001225 loops=1)                                           ->  Seq Scan on lineitem
(cost=0.00..195405.25 rows=6001225 width=8) (actual time=20.07..115247.61
rows=6001225 loops=1)                                           ->  Hash
(cost=162645.00..162645.00 rows=1500000 width=4) (actual
time=65943.80..65943.80 rows=0 loops=1)                                                 ->  Seq Scan on orders
(cost=0.00..162645.00 rows=1500000 width=4) (actual time=39.04..52049.90
rows=1500000 loops=1)
Total runtime: 1539010.00 msec



Thanks and regards




Re: [HACKERS] please help on query

From
Masaru Sugawara
Date:
On Tue, 16 Jul 2002 10:51:03 +0200
"Luis Alberto Amigo Navarro" <lamigo@atc.unican.es> wrote:


> Aggregate  (cost=1257368.92..1287375.04 rows=600122 width=12) (actual
> time=1236941.71..1454824.56 rows=62 loops=1)
>   ->  Group  (cost=1257368.92..1272371.98 rows=6001225 width=12) (actual
> time=1233968.87..1385034.91 rows=6001225 loops=1)
>         ->  Sort  (cost=1257368.92..1257368.92 rows=6001225 width=12)
> (actual time=1233968.82..1276147.37 rows=6001225 loops=1)
>               ->  Hash Join  (cost=166395.00..520604.08 rows=6001225
> width=12) (actual time=59061.21..773997.08 rows=6001225 loops=1)
>                     ->  Seq Scan on lineitem  (cost=0.00..195405.25
> rows=6001225 width=8) (actual time=20.66..115511.34 rows=6001225 loops=1)
>                     ->  Hash  (cost=162645.00..162645.00 rows=1500000
> width=4) (actual time=59032.16..59032.16 rows=0 loops=1)
>                           ->  Seq Scan on orders  (cost=0.00..162645.00
> rows=1500000 width=4) (actual time=17.33..44420.10 rows=1500000 loops=1)
> Total runtime: 1454929.11 msec


Hmm, does each of the three tables have some indices like the following?
If not so, could you execute EXPLAIN ANALYZE after creating the indices.


create index idx_lineitem_orderkey on lineitem(orderkey);
create index idx_orders_orderkey on orders(orderkey);
create index idx_orders_custkey on orders(custkey);
create index idx_customer_custkey on customer(custkey);


Regards,
Masaru Sugawara