Thread: Is a better way to have the same result of this query?

Is a better way to have the same result of this query?

From
Vernon Wu
Date:
I have the following query:

SELECT p.userid, p.year, a.country, a.province, a.city
FROM profile p, account a
WHERE p.userid=a.userid AND
    (p.year BETWEEN 1961 AND 1976) AND
    a.country='CA' AND
    a.province='BC' AND
    p.gender='f' AND
    p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
    block.userid IS NOT NULL AND
    p.userid IN
    (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
    f.minage AND f.maxage)

In plain English, it is that

Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what they
prefer.

The query plan is the followings:

Nested Loop  (cost=0.00..127.12 rows=995 width=894)
  ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
        ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
        ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
              SubPlan
                ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
                      ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
)
                ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
                      ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
                            ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
                            ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
  ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)

It seems take quite long to run this query. How to optimise the query?

Thanks for your input.

Vernon




Re: Is a better way to have the same result of this

From
Ron Johnson
Date:
On Wed, 2002-12-04 at 18:26, Vernon Wu wrote:
> I have the following query:
>
> SELECT p.userid, p.year, a.country, a.province, a.city
> FROM profile p, account a
> WHERE p.userid=a.userid AND
>     (p.year BETWEEN 1961 AND 1976) AND
>     a.country='CA' AND
>     a.province='BC' AND
>     p.gender='f' AND
>     p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>     block.userid IS NOT NULL AND
>     p.userid IN
>     (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
>     f.minage AND f.maxage)
>
> In plain English, it is that
>
> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what
they 
> prefer.
>
> The query plan is the followings:
>
> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
>               SubPlan
>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
> )
>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
>
> It seems take quite long to run this query. How to optimise the query?
>
> Thanks for your input.
>
> Vernon

What kind of indexes, if any, do you have on, and what is the
cardinality of account, block and preference?

What version of Postgres are you using?

How much shared memory and buffers are you using?

--
+------------------------------------------------------------+
| Ron Johnson, Jr.     mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA   http://members.cox.net/ron.l.johnson  |
|                                                            |
| "they love our milk and honey, but preach about another    |
|  way of living"                                            |
|    Merle Haggard, "The Fighting Side Of Me"                |
+------------------------------------------------------------+


Re: Is a better way to have the same result of this query?

From
Jochem van Dieten
Date:
Vernon Wu wrote:
>
> SELECT p.userid, p.year, a.country, a.province, a.city
> FROM profile p, account a
> WHERE p.userid=a.userid AND
>     (p.year BETWEEN 1961 AND 1976) AND
>     a.country='CA' AND
>     a.province='BC' AND
>     p.gender='f' AND
>     p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>     block.userid IS NOT NULL AND
>     p.userid IN
>     (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
>     f.minage AND f.maxage)

You might want to flatten this into more joins and less subqueries,
especially since you are using IN which is not very optimized:

SELECT p.userid, p.year, a.country, a.province, a.city
FROM profile p, account a, preference f, profile p1
WHERE
    f.userid = p.userid AND
    p.userid=a.userid AND
    (p.year BETWEEN 1961 AND 1976) AND
    a.country='CA' AND
    a.province='BC' AND
    p.gender='f' AND
    p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
    block.userid IS NOT NULL AND
    p1.userid='Joe' AND
    2002-p1.year BETWEEN f.minage AND f.maxage

Also, I am not sure about the NOT IN. If you can rewrite it using EXISTS
try that, it might be faster.


> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
>               SubPlan
>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
> )
>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)

rows=1000 usually indicates you didn't vacuum analyze. Did you?

>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)

And to add to Vernons questions: if you are using PostgreSQL 7.2 or
later, please send us the EXPLAIN ANALYZE output.

Jochem


Re: Is a better way to have the same result of this query?

From
Vernon Wu
Date:
Jochem,

Thanks for your suggestion/information.

The followings are the analyise outcomes after I did some modifications with the query. My finding is that the new
query 
does improve the performance according to the plan. The actual time is reversed might due to the fact the test data is
very small (the machine is a very old one by the way). The userid is the key for all tables and the gender is indexed.
DoI  
also index the country and province to improve the preformance?

The modified query with the suggested flatting query.

Nested Loop  (cost=0.00..91.97 rows=995 width=445) (actual time=1.00..3.00 rows=2 loops=1)
  ->  Nested Loop  (cost=0.00..62.02 rows=1 width=445) (actual time=1.00..3.00 rows=2 loops=1)
        ->  Nested Loop  (cost=0.00..34.68 rows=1 width=378) (actual time=1.00..3.00 rows=3 loops=1)
              ->  Nested Loop  (cost=0.00..29.84 rows=1 width=366) (actual time=1.00..3.00 rows=3 loops=1)
                    ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3
loops=1)
                    ->  Index Scan using pk_preference on preference f  (cost=0.00..4.82 rows=1 width=77) (actual time=
0.67..1.00 rows=1 loops=3)
              ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12) (actual time=0.00..0.00
rows=
1 loops=3)
        ->  Index Scan using pk_profile on profile p  (cost=0.00..27.33 rows=1 width=67) (actual time=0.00..0.00 rows=1

loops=3)
              SubPlan
                ->  Materialize  (cost=22.50..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=2)
                      ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0
loops=1)
  ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2)
Total runtime: 5.00 msec

After replacing "p.userid NOT IN" with "NOT EXISTS":

Result  (cost=0.00..61.56 rows=995 width=445) (actual time=3.00..4.00 rows=2 loops=1)
  InitPlan
    ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..61.56 rows=995 width=445) (actual time=3.00..4.00 rows=2 loops=1)
        ->  Nested Loop  (cost=0.00..31.61 rows=1 width=445) (actual time=3.00..4.00 rows=2 loops=1)
              ->  Nested Loop  (cost=0.00..26.77 rows=1 width=433) (actual time=2.00..3.00 rows=2 loops=1)
                    ->  Nested Loop  (cost=0.00..21.93 rows=1 width=356) (actual time=2.00..2.00 rows=2 loops=1)
                          ->  Index Scan using profile_sex_idx on profile p  (cost=0.00..17.09 rows=1 width=67) (actual
time=
1.00..1.00 rows=2 loops=1)
                          ->  Index Scan using pk_account on account a  (cost=0.00..4.83 rows=1 width=289) (actual
time=
0.50..0.50 rows=1 loops=2)
                    ->  Index Scan using pk_preference on preference f  (cost=0.00..4.82 rows=1 width=77) (actual time=
0.50..0.50 rows=1 loops=2)
              ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12) (actual time=0.50..0.50
rows=
1 loops=2)
        ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2)
Total runtime: 7.00 msec

After vacuum analyze:

Result  (cost=3.19..5.29 rows=1 width=91) (actual time=3.00..4.00 rows=2 loops=1)
  InitPlan
    ->  Seq Scan on block b  (cost=0.00..1.01 rows=1 width=7) (actual time=0.00..0.00 rows=0 loops=1)
  ->  Nested Loop  (cost=3.19..5.29 rows=1 width=91) (actual time=3.00..4.00 rows=2 loops=1)
        ->  Hash Join  (cost=3.19..4.27 rows=1 width=91) (actual time=3.00..3.00 rows=2 loops=1)
              ->  Hash Join  (cost=2.13..3.20 rows=1 width=72) (actual time=2.00..2.00 rows=3 loops=1)
                    ->  Seq Scan on account a  (cost=0.00..1.04 rows=3 width=31) (actual time=1.00..1.00 rows=3
loops=1)
                    ->  Hash  (cost=2.13..2.13 rows=1 width=41) (actual time=0.00..0.00 rows=0 loops=1)
                          ->  Nested Loop  (cost=0.00..2.13 rows=1 width=41) (actual time=0.00..0.00 rows=3 loops=1)
                                ->  Seq Scan on profile p1  (cost=0.00..1.04 rows=1 width=12) (actual time=0.00..0.00
rows=1 
loops=1)
                                ->  Seq Scan on preference f  (cost=0.00..1.03 rows=3 width=29) (actual time=0.00..0.00
rows=3 
loops=1)
              ->  Hash  (cost=1.05..1.05 rows=2 width=19) (actual time=1.00..1.00 rows=0 loops=1)
                    ->  Seq Scan on profile p  (cost=0.00..1.05 rows=2 width=19) (actual time=1.00..1.00 rows=2
loops=1)
        ->  Seq Scan on block  (cost=0.00..1.01 rows=1 width=0) (actual time=0.00..0.00 rows=1 loops=2)
Total runtime: 7.00 msec

The original query

Nested Loop  (cost=0.00..127.12 rows=995 width=894) (actual time=1.00..2.00 rows=2 loops=1)
  ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894) (actual time=1.00..1.00 rows=2 loops=1)
        ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3 loops=1)
        ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605) (actual time=0.33..0.33
rows=1 
loops=3)
              SubPlan
                ->  Materialize  (cost=22.50..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=2)
                      ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0
loops=1)
                ->  Materialize  (cost=44.82..44.82 rows=111 width=89) (actual time=0.50..0.50 rows=1 loops=2)
                      ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89) (actual time=0.00..0.00 rows=3 loops=1)
                            ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12) (actual
time=
0.00..0.00 rows=1 loops=1)
                            ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77) (actual time=0.00..0.00
rows=
3 loops=1)
  ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2)
Total runtime: 4.00 msec

After replacing "p.userid NOT IN" with "NOT EXISTS":

Result  (cost=0.00..104.62 rows=995 width=894) (actual time=1.00..2.00 rows=2 loops=1)
  InitPlan
    ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1)
  ->  Nested Loop  (cost=0.00..104.62 rows=995 width=894) (actual time=1.00..1.00 rows=2 loops=1)
        ->  Nested Loop  (cost=0.00..74.67 rows=1 width=894) (actual time=1.00..1.00 rows=2 loops=1)
              ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3 loops=1)
              ->  Index Scan using pk_profile on profile p  (cost=0.00..49.66 rows=1 width=605) (actual time=0.33..0.33

rows=1 loops=3)
                    SubPlan
                      ->  Materialize  (cost=44.82..44.82 rows=111 width=89) (actual time=0.50..0.50 rows=1 loops=2)
                            ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89) (actual time=0.00..1.00 rows=3
loops=1)
                                  ->  Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12)
(actualtime= 
0.00..0.00 rows=1 loops=1)
                                  ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77) (actual
time=0.00..1.00 
rows=3 loops=1)
        ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2)
Total runtime: 4.00 msec

After vacuum analyze:

Result  (cost=7.30..9.39 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1)
  InitPlan
    ->  Seq Scan on block b  (cost=0.00..1.01 rows=1 width=7) (actual time=0.00..0.00 rows=0 loops=1)
  ->  Nested Loop  (cost=7.30..9.39 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1)
        ->  Seq Scan on block  (cost=0.00..1.01 rows=1 width=0) (actual time=0.00..0.00 rows=1 loops=1)
        ->  Materialize  (cost=8.37..8.37 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1)
              ->  Hash Join  (cost=7.30..8.37 rows=1 width=63) (actual time=2.00..3.00 rows=2 loops=1)
                    ->  Seq Scan on account a  (cost=0.00..1.04 rows=3 width=31) (actual time=0.00..0.00 rows=3
loops=1)
                    ->  Hash  (cost=7.30..7.30 rows=1 width=32) (actual time=2.00..2.00 rows=0 loops=1)
                          ->  Index Scan using profile_sex_idx on profile p  (cost=0.00..7.30 rows=1 width=32) (actual
time=
2.00..2.00 rows=2 loops=1)
                                SubPlan
                                  ->  Materialize  (cost=2.13..2.13 rows=1 width=41) (actual time=0.50..0.50 rows=1
loops=2)
                                        ->  Nested Loop  (cost=0.00..2.13 rows=1 width=41) (actual time=1.00..1.00
rows=3loops=1) 
                                              ->  Seq Scan on profile p1  (cost=0.00..1.04 rows=1 width=12) (actual
time=0.00..0.00 
rows=1 loops=1)
                                              ->  Seq Scan on preference f  (cost=0.00..1.03 rows=3 width=29) (actual
time=0.00..0.00 
rows=3 loops=1)
Total runtime: 4.00 msec


12/5/2002 2:01:15 AM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote:

>Vernon Wu wrote:
>>
>> SELECT p.userid, p.year, a.country, a.province, a.city
>> FROM profile p, account a
>> WHERE p.userid=a.userid AND
>>     (p.year BETWEEN 1961 AND 1976) AND
>>     a.country='CA' AND
>>     a.province='BC' AND
>>     p.gender='f' AND
>>     p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>>     block.userid IS NOT NULL AND
>>     p.userid IN
>>     (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
>>     f.minage AND f.maxage)
>
>You might want to flatten this into more joins and less subqueries,
>especially since you are using IN which is not very optimized:
>
>SELECT p.userid, p.year, a.country, a.province, a.city
>FROM profile p, account a, preference f, profile p1
>WHERE
>    f.userid = p.userid AND
>    p.userid=a.userid AND
>    (p.year BETWEEN 1961 AND 1976) AND
>    a.country='CA' AND
>    a.province='BC' AND
>    p.gender='f' AND
>    p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>    block.userid IS NOT NULL AND
>    p1.userid='Joe' AND
>    2002-p1.year BETWEEN f.minage AND f.maxage
>
>Also, I am not sure about the NOT IN. If you can rewrite it using EXISTS
>try that, it might be faster.
>
>
>> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
>>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
>>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
>>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
>>               SubPlan
>>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
>>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
>> )
>>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
>>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
>>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
>>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
>
>rows=1000 usually indicates you didn't vacuum analyze. Did you?
>
>>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
>
>And to add to Vernons questions: if you are using PostgreSQL 7.2 or
>later, please send us the EXPLAIN ANALYZE output.
>
>Jochem
>
>




Re: Is a better way to have the same result of this

From
Vernon Wu
Date:
Ron,

The gender is indexed. Each user has account and preference, but not necessary block.

I am currently seeking for query optimisation, not system configuration optimisation

12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote:

>On Wed, 2002-12-04 at 18:26, Vernon Wu wrote:
>> I have the following query:
>>
>> SELECT p.userid, p.year, a.country, a.province, a.city
>> FROM profile p, account a
>> WHERE p.userid=a.userid AND
>>     (p.year BETWEEN 1961 AND 1976) AND
>>     a.country='CA' AND
>>     a.province='BC' AND
>>     p.gender='f' AND
>>     p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>>     block.userid IS NOT NULL AND
>>     p.userid IN
>>     (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
>>     f.minage AND f.maxage)
>>
>> In plain English, it is that
>>
>> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what
they
>> prefer.
>>
>> The query plan is the followings:
>>
>> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
>>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
>>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
>>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
>>               SubPlan
>>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
>>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
>> )
>>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
>>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
>>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
>>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
>>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
>>
>> It seems take quite long to run this query. How to optimise the query?
>>
>> Thanks for your input.
>>
>> Vernon
>
>What kind of indexes, if any, do you have on, and what is the
>cardinality of account, block and preference?
>
>What version of Postgres are you using?
>
>How much shared memory and buffers are you using?
>
>--
>+------------------------------------------------------------+
>| Ron Johnson, Jr.     mailto:ron.l.johnson@cox.net          |
>| Jefferson, LA  USA   http://members.cox.net/ron.l.johnson  |
>|                                                            |
>| "they love our milk and honey, but preach about another    |
>|  way of living"                                            |
>|    Merle Haggard, "The Fighting Side Of Me"                |
>+------------------------------------------------------------+
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 4: Don't 'kill -9' the postmaster
>




Re: Is a better way to have the same result of this

From
Tomasz Myrta
Date:
Vernon Wu wrote:

 > Ron,
 >
 > The gender is indexed. Each user has account and preference, but not
necessary block.


Indexing on gender won't speed up your query - it can even slow it down.
You have probably 50% of "f" and 50% of "m". Using index on gender will
divide your potential answers by 2. Make index on columns, which
excludes much more useless rows.
I think you can create index on:
- block/personid
- profile/userid

I read in Postgres documentation(but didn't try) that you can also
change "id NOT IN (select id" to "not exists select * where id=". It may
help also.

Do user have more than one account or preference?
If no, you can change "not in" into "inner/outer join" which are the
best ones.

Regards,
Tomasz Myrta


Re: Is a better way to have the same result of this

From
Jean-Luc Lachance
Date:
It is now common knowledge that the IN clause should be rewriten as an
EXISTS.

SELECT p.userid, p.year, a.country, a.province, a.city
FROM profile p, account a
WHERE p.userid=a.userid AND
  (p.year BETWEEN 1961 AND 1976) AND
  a.country='CA' AND
  a.province='BC' AND
  p.gender='f' AND
  NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid
= b.userid) AND
  block.userid IS NOT NULL AND
  EXISTS ( SELECT 1 FROM preference f, profile p1
    WHERE p1.userid='Joe' AND p.userid = f.userif AND
      2002-p1.year BETWEEN f.minage AND f.maxage);



Vernon Wu wrote:
>
> Ron,
>
> The gender is indexed. Each user has account and preference, but not necessary block.
>
> I am currently seeking for query optimisation, not system configuration optimisation
>
> 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote:
>
> >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote:
> >> I have the following query:
> >>
> >> SELECT p.userid, p.year, a.country, a.province, a.city
> >> FROM profile p, account a
> >> WHERE p.userid=a.userid AND
> >>      (p.year BETWEEN 1961 AND 1976) AND
> >>      a.country='CA' AND
> >>      a.province='BC' AND
> >>      p.gender='f' AND
> >>      p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
> >>      block.userid IS NOT NULL AND
> >>      p.userid IN
> >>      (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
> >>      f.minage AND f.maxage)
> >>
> >> In plain English, it is that
> >>
> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what
> they
> >> prefer.
> >>
> >> The query plan is the followings:
> >>
> >> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
> >>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
> >>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
> >>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
> >>               SubPlan
> >>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
> >>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
> >> )
> >>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
> >>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
> >>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
> >>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
> >>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
> >>

Re: Is a better way to have the same result of this

From
Andrew Sullivan
Date:
On Thu, Dec 05, 2002 at 11:08:17AM -0800, Vernon Wu wrote:
> Ron,
>
> The gender is indexed.

Given that gender only has two (?  Very few, anyway) values, I can't
believe an index will be much use: it's not very selective.  Maybe
combining several columns in one index will help you.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Is a better way to have the same result of this

From
Vernon Wu
Date:
I just learnt the "common knowledge" about four hourse ago. That does help to improve the performance indeed
according to the explain command.

12/5/2002 11:45:26 AM, Jean-Luc Lachance <jllachan@nsd.ca> wrote:

>It is now common knowledge that the IN clause should be rewriten as an
>EXISTS.
>
>SELECT p.userid, p.year, a.country, a.province, a.city
>FROM profile p, account a
>WHERE p.userid=a.userid AND
>  (p.year BETWEEN 1961 AND 1976) AND
>  a.country='CA' AND
>  a.province='BC' AND
>  p.gender='f' AND
>  NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid
>= b.userid) AND
>  block.userid IS NOT NULL AND
>  EXISTS ( SELECT 1 FROM preference f, profile p1
>    WHERE p1.userid='Joe' AND p.userid = f.userif AND
>      2002-p1.year BETWEEN f.minage AND f.maxage);
>
>
>
>Vernon Wu wrote:
>>
>> Ron,
>>
>> The gender is indexed. Each user has account and preference, but not necessary block.
>>
>> I am currently seeking for query optimisation, not system configuration optimisation
>>
>> 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote:
>>
>> >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote:
>> >> I have the following query:
>> >>
>> >> SELECT p.userid, p.year, a.country, a.province, a.city
>> >> FROM profile p, account a
>> >> WHERE p.userid=a.userid AND
>> >>      (p.year BETWEEN 1961 AND 1976) AND
>> >>      a.country='CA' AND
>> >>      a.province='BC' AND
>> >>      p.gender='f' AND
>> >>      p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
>> >>      block.userid IS NOT NULL AND
>> >>      p.userid IN
>> >>      (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
>> >>      f.minage AND f.maxage)
>> >>
>> >> In plain English, it is that
>> >>
>> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what
>> they
>> >> prefer.
>> >>
>> >> The query plan is the followings:
>> >>
>> >> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
>> >>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
>> >>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
>> >>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
>> >>               SubPlan
>> >>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
>> >>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
>> >> )
>> >>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
>> >>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
>> >>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
>> >>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
>> >>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
>> >>
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>




Re: Is a better way to have the same result of this

From
Jean-Luc Lachance
Date:
Good for you!  Too bad the parser does not know about it...



Vernon Wu wrote:
>
> I just learnt the "common knowledge" about four hourse ago. That does help to improve the performance indeed
> according to the explain command.
>
> 12/5/2002 11:45:26 AM, Jean-Luc Lachance <jllachan@nsd.ca> wrote:
>
> >It is now common knowledge that the IN clause should be rewriten as an
> >EXISTS.
> >
> >SELECT p.userid, p.year, a.country, a.province, a.city
> >FROM profile p, account a
> >WHERE p.userid=a.userid AND
> >  (p.year BETWEEN 1961 AND 1976) AND
> >  a.country='CA' AND
> >  a.province='BC' AND
> >  p.gender='f' AND
> >  NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid
> >= b.userid) AND
> >  block.userid IS NOT NULL AND
> >  EXISTS ( SELECT 1 FROM preference f, profile p1
> >    WHERE p1.userid='Joe' AND p.userid = f.userif AND
> >      2002-p1.year BETWEEN f.minage AND f.maxage);
> >
> >
> >
> >Vernon Wu wrote:
> >>
> >> Ron,
> >>
> >> The gender is indexed. Each user has account and preference, but not necessary block.
> >>
> >> I am currently seeking for query optimisation, not system configuration optimisation
> >>
> >> 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote:
> >>
> >> >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote:
> >> >> I have the following query:
> >> >>
> >> >> SELECT p.userid, p.year, a.country, a.province, a.city
> >> >> FROM profile p, account a
> >> >> WHERE p.userid=a.userid AND
> >> >>      (p.year BETWEEN 1961 AND 1976) AND
> >> >>      a.country='CA' AND
> >> >>      a.province='BC' AND
> >> >>      p.gender='f' AND
> >> >>      p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND
> >> >>      block.userid IS NOT NULL AND
> >> >>      p.userid IN
> >> >>      (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN
> >> >>      f.minage AND f.maxage)
> >> >>
> >> >> In plain English, it is that
> >> >>
> >> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between
what
> >> they
> >> >> prefer.
> >> >>
> >> >> The query plan is the followings:
> >> >>
> >> >> Nested Loop  (cost=0.00..127.12 rows=995 width=894)
> >> >>   ->  Nested Loop  (cost=0.00..97.17 rows=1 width=894)
> >> >>         ->  Seq Scan on account a  (cost=0.00..25.00 rows=1 width=289)
> >> >>         ->  Index Scan using pk_profile on profile p  (cost=0.00..72.16 rows=1 width=605)
> >> >>               SubPlan
> >> >>                 ->  Materialize  (cost=22.50..22.50 rows=5 width=55)
> >> >>                       ->  Seq Scan on block b  (cost=0.00..22.50 rows=5 width=55
> >> >> )
> >> >>                 ->  Materialize  (cost=44.82..44.82 rows=111 width=89)
> >> >>                       ->  Nested Loop  (cost=0.00..44.82 rows=111 width=89)
> >> >>                             ->  Index Scan using pk_profile on profile p1  (cost=0.00..4.82 rows=1 width=12)
> >> >>                             ->  Seq Scan on preference f  (cost=0.00..20.00 rows=1000 width=77)
> >> >>   ->  Seq Scan on block  (cost=0.00..20.00 rows=995 width=0)
> >> >>
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 5: Have you checked our extensive FAQ?
> >
> >http://www.postgresql.org/users-lounge/docs/faq.html
> >

Re: Is a better way to have the same result of this query?

From
Jochem van Dieten
Date:
Vernon Wu wrote:
>
> The followings are the analyise outcomes after I did some modifications with the query. My finding is that the new
query 
> does improve the performance according to the plan. The actual time is reversed might due to the fact the test data
is 
> very small (the machine is a very old one by the way). The userid is the key for all tables and the gender is
indexed.Do I  
> also index the country and province to improve the preformance?

You start by using a dataset of realistic size. Sorry, but if the actual
time is < 10.00 ms it is rather pointless to optimize further since
chance is going to be the biggest factor. And the IN/EXISTS difference
is dependent on dataset size.

Jochem


Re: Is a better way to have the same result of this query?

From
Vernon Wu
Date:

12/5/2002 1:06:03 PM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote:

>You start by using a dataset of realistic size. Sorry, but if the actual
>time is < 10.00 ms it is rather pointless to optimize further since
>chance is going to be the biggest factor. And the IN/EXISTS difference
>is dependent on dataset size.
>

Do you mean that using "EXIST" is not necessary out-perform using 'IN" even the "explain" say so? What is the right
size for those two key words?

Thanks for your very hepful information.

Vernon



Re: Is a better way to have the same result of this query?

From
Jochem van Dieten
Date:
Vernon Wu wrote:
> 12/5/2002 1:06:03 PM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote:
>
>>You start by using a dataset of realistic size. Sorry, but if the actual
>>time is < 10.00 ms it is rather pointless to optimize further since
>>chance is going to be the biggest factor. And the IN/EXISTS difference
>>is dependent on dataset size.
>
> Do you mean that using "EXIST" is not necessary out-perform using 'IN" even the "explain" say so? What is the right
> size for those two key words?

IIRC, IN might be faster on small datasets, but EXISTS is faster on big
ones. So you have to optimize with a dataset that resembles the actual
dataset you will be using in production as close as possible. I don't
know what the size is at which one gets faster as the other.

Jochem


Re: Is a better way to have the same result of this

From
Vernon Wu
Date:
Andrew,

Following your suggestion, I have combined the year field with the gender to create a multicolumn index.  That shall be

better than indexing gender alone. I also create a multicolumn index (country, province, city) for the account table.

Would you suggest indexing all possible fields such as ethnicity, religion    , education, employment in the profile
table;or  
based on what queries I run, to have some multicolumn indexes?

BTW, do you get a lot of snow in Toronto these few days?

Veronon

12/5/2002 11:57:50 AM, Andrew Sullivan <andrew@libertyrms.info> wrote:

>On Thu, Dec 05, 2002 at 11:08:17AM -0800, Vernon Wu wrote:
>> Ron,
>>
>> The gender is indexed.
>
>Given that gender only has two (?  Very few, anyway) values, I can't
>believe an index will be much use: it's not very selective.  Maybe
>combining several columns in one index will help you.
>
>A
>
>--
>----
>Andrew Sullivan                         204-4141 Yonge Street
>Liberty RMS                           Toronto, Ontario Canada
><andrew@libertyrms.info>                              M2P 2A8
>                                         +1 416 646 3304 x110
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>




Re: Is a better way to have the same result of this

From
Andrew Sullivan
Date:
On Thu, Dec 05, 2002 at 04:04:28PM -0500, Jean-Luc Lachance wrote:
> Good for you!  Too bad the parser does not know about it...

LAst I heard, there was a problem about providing a rigourous
mathematical proof that NOT EXISTS and NOT IN are really the same.
If you can prove it, I'm sure people would be pleased.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Is a better way to have the same result of this

From
Andrew Sullivan
Date:
On Thu, Dec 05, 2002 at 03:19:29PM -0800, Vernon Wu wrote:
> Andrew,
>
> Would you suggest indexing all possible fields such as ethnicity,
> religion , education, employment in the profile table; or based on
> what queries I run, to have some multicolumn indexes?

Never index anything more than you need.  There is a fairly serious
penalty at insertion time for indexes, so you can reduce some
overhead that way.  Note, too, that index space is not recaptured by
Postgres's VACUUM, which imposes a small performance cost, but can be
a real disk-gobbler if you're not careful.

> BTW, do you get a lot of snow in Toronto these few days?

We had some a few weeks ago.  It's pretty clear right now.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: Is a better way to have the same result of this

From
"scott.marlowe"
Date:
On Thu, 5 Dec 2002, Vernon Wu wrote:

> Andrew,
>
> Following your suggestion, I have combined the year field with the gender to create a multicolumn index.  That shall
be 
> better than indexing gender alone. I also create a multicolumn index (country, province, city) for the account table.

>
> Would you suggest indexing all possible fields such as ethnicity, religion    , education, employment in the profile
table;or  
> based on what queries I run, to have some multicolumn indexes?
>
> BTW, do you get a lot of snow in Toronto these few days?

Vernon, just so you know, for multi-column indexes to be useful in
Postgresql, the columns need to be used in the same order they are
declared in the index if you are using them for an order by.

select * from table order by sex, age;

could use the index

create column table_sex_age on table (sex,age);

but would not use the index

create column table_age_sex on table (age,sex);

However, the order in a where clause portion doesn't really seem to
matter, so

select * from table where sex='m' and age>=38

and

select * from table where age>=38 and sex='m'

should both be able to use the index.

also, you can use functional indexes, but the arguments in the where
clause need the same basic form to be useful.  So, if you commonly make a
select like this:

select * from table where age>50 and age<=59;

then you could make a functional index like :

create index table_age_50_59 on table (age) where age>50 and age<=59;

However, the query

select * from table where age>50 and age<=58;

Wouldn't use that index, since the age <= part doesn't match up.  It could
possible use a generic index on age though, i.e. one like

create index table_age on table (age);

But that index will be larger than the partial one, and so the planner may
skip using it and use a seq scan instead.  Hard to say until your database
is populated with some representational test data.

Since these indexes will be only a small fraction of the total data, it
will often be advantageous to use them with a query.

After you have a set of test data, then you can start looking at tuning
random page cost and such to make your hardware perform properly for
individual queries.  Well, hope that helps.


Re: Is a better way to have the same result of this

From
Vernon Wu
Date:
12/5/2002 4:18:10 PM, "scott.marlowe" <scott.marlowe@ihs.com> wrote:

>Vernon, just so you know, for multi-column indexes to be useful in
>Postgresql, the columns need to be used in the same order they are
>declared in the index if you are using them for an order by.
>
>select * from table order by sex, age;
>
>could use the index
>
>create column table_sex_age on table (sex,age);
>
>but would not use the index
>
>create column table_age_sex on table (age,sex);
>

I haven't have this case yet, might apply for some queries soon.

>However, the order in a where clause portion doesn't really seem to
>matter, so
>
>select * from table where sex='m' and age>=38
>
>and
>
>select * from table where age>=38 and sex='m'
>
>should both be able to use the index.
>
>also, you can use functional indexes, but the arguments in the where
>clause need the same basic form to be useful.  So, if you commonly make a
>select like this:
>
>select * from table where age>50 and age<=59;
>
>then you could make a functional index like :
>
>create index table_age_50_59 on table (age) where age>50 and age<=59;
>
>However, the query
>
>select * from table where age>50 and age<=58;
>
>Wouldn't use that index, since the age <= part doesn't match up.  It could
>possible use a generic index on age though, i.e. one like
>
>create index table_age on table (age);
>

I didn't know the functional index. Thanks for the eductional information.

>But that index will be larger than the partial one, and so the planner may
>skip using it and use a seq scan instead.  Hard to say until your database
>is populated with some representational test data.
>
>Since these indexes will be only a small fraction of the total data, it
>will often be advantageous to use them with a query.
>
>After you have a set of test data, then you can start looking at tuning
>random page cost and such to make your hardware perform properly for
>individual queries.  Well, hope that helps.
>
>

I will do some fine query tuning in the final test phase. Right now, I want to make sure the table design and queries
are 
on the right track.

That indeed helps.

Thanks,

Vernon