Thread: Is a better way to have the same result of this query?
I have the following query: SELECT p.userid, p.year, a.country, a.province, a.city FROM profile p, account a WHERE p.userid=a.userid AND (p.year BETWEEN 1961 AND 1976) AND a.country='CA' AND a.province='BC' AND p.gender='f' AND p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND block.userid IS NOT NULL AND p.userid IN (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN f.minage AND f.maxage) In plain English, it is that Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what they prefer. The query plan is the followings: Nested Loop (cost=0.00..127.12 rows=995 width=894) -> Nested Loop (cost=0.00..97.17 rows=1 width=894) -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) SubPlan -> Materialize (cost=22.50..22.50 rows=5 width=55) -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 ) -> Materialize (cost=44.82..44.82 rows=111 width=89) -> Nested Loop (cost=0.00..44.82 rows=111 width=89) -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) It seems take quite long to run this query. How to optimise the query? Thanks for your input. Vernon
On Wed, 2002-12-04 at 18:26, Vernon Wu wrote: > I have the following query: > > SELECT p.userid, p.year, a.country, a.province, a.city > FROM profile p, account a > WHERE p.userid=a.userid AND > (p.year BETWEEN 1961 AND 1976) AND > a.country='CA' AND > a.province='BC' AND > p.gender='f' AND > p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND > block.userid IS NOT NULL AND > p.userid IN > (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN > f.minage AND f.maxage) > > In plain English, it is that > > Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what they > prefer. > > The query plan is the followings: > > Nested Loop (cost=0.00..127.12 rows=995 width=894) > -> Nested Loop (cost=0.00..97.17 rows=1 width=894) > -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) > -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) > SubPlan > -> Materialize (cost=22.50..22.50 rows=5 width=55) > -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 > ) > -> Materialize (cost=44.82..44.82 rows=111 width=89) > -> Nested Loop (cost=0.00..44.82 rows=111 width=89) > -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) > -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) > -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) > > It seems take quite long to run this query. How to optimise the query? > > Thanks for your input. > > Vernon What kind of indexes, if any, do you have on, and what is the cardinality of account, block and preference? What version of Postgres are you using? How much shared memory and buffers are you using? -- +------------------------------------------------------------+ | Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | | Jefferson, LA USA http://members.cox.net/ron.l.johnson | | | | "they love our milk and honey, but preach about another | | way of living" | | Merle Haggard, "The Fighting Side Of Me" | +------------------------------------------------------------+
Vernon Wu wrote: > > SELECT p.userid, p.year, a.country, a.province, a.city > FROM profile p, account a > WHERE p.userid=a.userid AND > (p.year BETWEEN 1961 AND 1976) AND > a.country='CA' AND > a.province='BC' AND > p.gender='f' AND > p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND > block.userid IS NOT NULL AND > p.userid IN > (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN > f.minage AND f.maxage) You might want to flatten this into more joins and less subqueries, especially since you are using IN which is not very optimized: SELECT p.userid, p.year, a.country, a.province, a.city FROM profile p, account a, preference f, profile p1 WHERE f.userid = p.userid AND p.userid=a.userid AND (p.year BETWEEN 1961 AND 1976) AND a.country='CA' AND a.province='BC' AND p.gender='f' AND p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND block.userid IS NOT NULL AND p1.userid='Joe' AND 2002-p1.year BETWEEN f.minage AND f.maxage Also, I am not sure about the NOT IN. If you can rewrite it using EXISTS try that, it might be faster. > Nested Loop (cost=0.00..127.12 rows=995 width=894) > -> Nested Loop (cost=0.00..97.17 rows=1 width=894) > -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) > -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) > SubPlan > -> Materialize (cost=22.50..22.50 rows=5 width=55) > -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 > ) > -> Materialize (cost=44.82..44.82 rows=111 width=89) > -> Nested Loop (cost=0.00..44.82 rows=111 width=89) > -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) > -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) rows=1000 usually indicates you didn't vacuum analyze. Did you? > -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) And to add to Vernons questions: if you are using PostgreSQL 7.2 or later, please send us the EXPLAIN ANALYZE output. Jochem
Jochem, Thanks for your suggestion/information. The followings are the analyise outcomes after I did some modifications with the query. My finding is that the new query does improve the performance according to the plan. The actual time is reversed might due to the fact the test data is very small (the machine is a very old one by the way). The userid is the key for all tables and the gender is indexed. DoI also index the country and province to improve the preformance? The modified query with the suggested flatting query. Nested Loop (cost=0.00..91.97 rows=995 width=445) (actual time=1.00..3.00 rows=2 loops=1) -> Nested Loop (cost=0.00..62.02 rows=1 width=445) (actual time=1.00..3.00 rows=2 loops=1) -> Nested Loop (cost=0.00..34.68 rows=1 width=378) (actual time=1.00..3.00 rows=3 loops=1) -> Nested Loop (cost=0.00..29.84 rows=1 width=366) (actual time=1.00..3.00 rows=3 loops=1) -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3 loops=1) -> Index Scan using pk_preference on preference f (cost=0.00..4.82 rows=1 width=77) (actual time= 0.67..1.00 rows=1 loops=3) -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) (actual time=0.00..0.00 rows= 1 loops=3) -> Index Scan using pk_profile on profile p (cost=0.00..27.33 rows=1 width=67) (actual time=0.00..0.00 rows=1 loops=3) SubPlan -> Materialize (cost=22.50..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=2) -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1) -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2) Total runtime: 5.00 msec After replacing "p.userid NOT IN" with "NOT EXISTS": Result (cost=0.00..61.56 rows=995 width=445) (actual time=3.00..4.00 rows=2 loops=1) InitPlan -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1) -> Nested Loop (cost=0.00..61.56 rows=995 width=445) (actual time=3.00..4.00 rows=2 loops=1) -> Nested Loop (cost=0.00..31.61 rows=1 width=445) (actual time=3.00..4.00 rows=2 loops=1) -> Nested Loop (cost=0.00..26.77 rows=1 width=433) (actual time=2.00..3.00 rows=2 loops=1) -> Nested Loop (cost=0.00..21.93 rows=1 width=356) (actual time=2.00..2.00 rows=2 loops=1) -> Index Scan using profile_sex_idx on profile p (cost=0.00..17.09 rows=1 width=67) (actual time= 1.00..1.00 rows=2 loops=1) -> Index Scan using pk_account on account a (cost=0.00..4.83 rows=1 width=289) (actual time= 0.50..0.50 rows=1 loops=2) -> Index Scan using pk_preference on preference f (cost=0.00..4.82 rows=1 width=77) (actual time= 0.50..0.50 rows=1 loops=2) -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) (actual time=0.50..0.50 rows= 1 loops=2) -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2) Total runtime: 7.00 msec After vacuum analyze: Result (cost=3.19..5.29 rows=1 width=91) (actual time=3.00..4.00 rows=2 loops=1) InitPlan -> Seq Scan on block b (cost=0.00..1.01 rows=1 width=7) (actual time=0.00..0.00 rows=0 loops=1) -> Nested Loop (cost=3.19..5.29 rows=1 width=91) (actual time=3.00..4.00 rows=2 loops=1) -> Hash Join (cost=3.19..4.27 rows=1 width=91) (actual time=3.00..3.00 rows=2 loops=1) -> Hash Join (cost=2.13..3.20 rows=1 width=72) (actual time=2.00..2.00 rows=3 loops=1) -> Seq Scan on account a (cost=0.00..1.04 rows=3 width=31) (actual time=1.00..1.00 rows=3 loops=1) -> Hash (cost=2.13..2.13 rows=1 width=41) (actual time=0.00..0.00 rows=0 loops=1) -> Nested Loop (cost=0.00..2.13 rows=1 width=41) (actual time=0.00..0.00 rows=3 loops=1) -> Seq Scan on profile p1 (cost=0.00..1.04 rows=1 width=12) (actual time=0.00..0.00 rows=1 loops=1) -> Seq Scan on preference f (cost=0.00..1.03 rows=3 width=29) (actual time=0.00..0.00 rows=3 loops=1) -> Hash (cost=1.05..1.05 rows=2 width=19) (actual time=1.00..1.00 rows=0 loops=1) -> Seq Scan on profile p (cost=0.00..1.05 rows=2 width=19) (actual time=1.00..1.00 rows=2 loops=1) -> Seq Scan on block (cost=0.00..1.01 rows=1 width=0) (actual time=0.00..0.00 rows=1 loops=2) Total runtime: 7.00 msec The original query Nested Loop (cost=0.00..127.12 rows=995 width=894) (actual time=1.00..2.00 rows=2 loops=1) -> Nested Loop (cost=0.00..97.17 rows=1 width=894) (actual time=1.00..1.00 rows=2 loops=1) -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3 loops=1) -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) (actual time=0.33..0.33 rows=1 loops=3) SubPlan -> Materialize (cost=22.50..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=2) -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1) -> Materialize (cost=44.82..44.82 rows=111 width=89) (actual time=0.50..0.50 rows=1 loops=2) -> Nested Loop (cost=0.00..44.82 rows=111 width=89) (actual time=0.00..0.00 rows=3 loops=1) -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) (actual time= 0.00..0.00 rows=1 loops=1) -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) (actual time=0.00..0.00 rows= 3 loops=1) -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2) Total runtime: 4.00 msec After replacing "p.userid NOT IN" with "NOT EXISTS": Result (cost=0.00..104.62 rows=995 width=894) (actual time=1.00..2.00 rows=2 loops=1) InitPlan -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55) (actual time=0.00..0.00 rows=0 loops=1) -> Nested Loop (cost=0.00..104.62 rows=995 width=894) (actual time=1.00..1.00 rows=2 loops=1) -> Nested Loop (cost=0.00..74.67 rows=1 width=894) (actual time=1.00..1.00 rows=2 loops=1) -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) (actual time=0.00..0.00 rows=3 loops=1) -> Index Scan using pk_profile on profile p (cost=0.00..49.66 rows=1 width=605) (actual time=0.33..0.33 rows=1 loops=3) SubPlan -> Materialize (cost=44.82..44.82 rows=111 width=89) (actual time=0.50..0.50 rows=1 loops=2) -> Nested Loop (cost=0.00..44.82 rows=111 width=89) (actual time=0.00..1.00 rows=3 loops=1) -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) (actualtime= 0.00..0.00 rows=1 loops=1) -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) (actual time=0.00..1.00 rows=3 loops=1) -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) (actual time=0.00..0.00 rows=1 loops=2) Total runtime: 4.00 msec After vacuum analyze: Result (cost=7.30..9.39 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1) InitPlan -> Seq Scan on block b (cost=0.00..1.01 rows=1 width=7) (actual time=0.00..0.00 rows=0 loops=1) -> Nested Loop (cost=7.30..9.39 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1) -> Seq Scan on block (cost=0.00..1.01 rows=1 width=0) (actual time=0.00..0.00 rows=1 loops=1) -> Materialize (cost=8.37..8.37 rows=1 width=63) (actual time=3.00..3.00 rows=2 loops=1) -> Hash Join (cost=7.30..8.37 rows=1 width=63) (actual time=2.00..3.00 rows=2 loops=1) -> Seq Scan on account a (cost=0.00..1.04 rows=3 width=31) (actual time=0.00..0.00 rows=3 loops=1) -> Hash (cost=7.30..7.30 rows=1 width=32) (actual time=2.00..2.00 rows=0 loops=1) -> Index Scan using profile_sex_idx on profile p (cost=0.00..7.30 rows=1 width=32) (actual time= 2.00..2.00 rows=2 loops=1) SubPlan -> Materialize (cost=2.13..2.13 rows=1 width=41) (actual time=0.50..0.50 rows=1 loops=2) -> Nested Loop (cost=0.00..2.13 rows=1 width=41) (actual time=1.00..1.00 rows=3loops=1) -> Seq Scan on profile p1 (cost=0.00..1.04 rows=1 width=12) (actual time=0.00..0.00 rows=1 loops=1) -> Seq Scan on preference f (cost=0.00..1.03 rows=3 width=29) (actual time=0.00..0.00 rows=3 loops=1) Total runtime: 4.00 msec 12/5/2002 2:01:15 AM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote: >Vernon Wu wrote: >> >> SELECT p.userid, p.year, a.country, a.province, a.city >> FROM profile p, account a >> WHERE p.userid=a.userid AND >> (p.year BETWEEN 1961 AND 1976) AND >> a.country='CA' AND >> a.province='BC' AND >> p.gender='f' AND >> p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND >> block.userid IS NOT NULL AND >> p.userid IN >> (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN >> f.minage AND f.maxage) > >You might want to flatten this into more joins and less subqueries, >especially since you are using IN which is not very optimized: > >SELECT p.userid, p.year, a.country, a.province, a.city >FROM profile p, account a, preference f, profile p1 >WHERE > f.userid = p.userid AND > p.userid=a.userid AND > (p.year BETWEEN 1961 AND 1976) AND > a.country='CA' AND > a.province='BC' AND > p.gender='f' AND > p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND > block.userid IS NOT NULL AND > p1.userid='Joe' AND > 2002-p1.year BETWEEN f.minage AND f.maxage > >Also, I am not sure about the NOT IN. If you can rewrite it using EXISTS >try that, it might be faster. > > >> Nested Loop (cost=0.00..127.12 rows=995 width=894) >> -> Nested Loop (cost=0.00..97.17 rows=1 width=894) >> -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) >> -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) >> SubPlan >> -> Materialize (cost=22.50..22.50 rows=5 width=55) >> -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 >> ) >> -> Materialize (cost=44.82..44.82 rows=111 width=89) >> -> Nested Loop (cost=0.00..44.82 rows=111 width=89) >> -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) >> -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) > >rows=1000 usually indicates you didn't vacuum analyze. Did you? > >> -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) > >And to add to Vernons questions: if you are using PostgreSQL 7.2 or >later, please send us the EXPLAIN ANALYZE output. > >Jochem > >
Ron, The gender is indexed. Each user has account and preference, but not necessary block. I am currently seeking for query optimisation, not system configuration optimisation 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote: >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote: >> I have the following query: >> >> SELECT p.userid, p.year, a.country, a.province, a.city >> FROM profile p, account a >> WHERE p.userid=a.userid AND >> (p.year BETWEEN 1961 AND 1976) AND >> a.country='CA' AND >> a.province='BC' AND >> p.gender='f' AND >> p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND >> block.userid IS NOT NULL AND >> p.userid IN >> (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN >> f.minage AND f.maxage) >> >> In plain English, it is that >> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what they >> prefer. >> >> The query plan is the followings: >> >> Nested Loop (cost=0.00..127.12 rows=995 width=894) >> -> Nested Loop (cost=0.00..97.17 rows=1 width=894) >> -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) >> -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) >> SubPlan >> -> Materialize (cost=22.50..22.50 rows=5 width=55) >> -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 >> ) >> -> Materialize (cost=44.82..44.82 rows=111 width=89) >> -> Nested Loop (cost=0.00..44.82 rows=111 width=89) >> -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) >> -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) >> -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) >> >> It seems take quite long to run this query. How to optimise the query? >> >> Thanks for your input. >> >> Vernon > >What kind of indexes, if any, do you have on, and what is the >cardinality of account, block and preference? > >What version of Postgres are you using? > >How much shared memory and buffers are you using? > >-- >+------------------------------------------------------------+ >| Ron Johnson, Jr. mailto:ron.l.johnson@cox.net | >| Jefferson, LA USA http://members.cox.net/ron.l.johnson | >| | >| "they love our milk and honey, but preach about another | >| way of living" | >| Merle Haggard, "The Fighting Side Of Me" | >+------------------------------------------------------------+ > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster >
Vernon Wu wrote: > Ron, > > The gender is indexed. Each user has account and preference, but not necessary block. Indexing on gender won't speed up your query - it can even slow it down. You have probably 50% of "f" and 50% of "m". Using index on gender will divide your potential answers by 2. Make index on columns, which excludes much more useless rows. I think you can create index on: - block/personid - profile/userid I read in Postgres documentation(but didn't try) that you can also change "id NOT IN (select id" to "not exists select * where id=". It may help also. Do user have more than one account or preference? If no, you can change "not in" into "inner/outer join" which are the best ones. Regards, Tomasz Myrta
It is now common knowledge that the IN clause should be rewriten as an EXISTS. SELECT p.userid, p.year, a.country, a.province, a.city FROM profile p, account a WHERE p.userid=a.userid AND (p.year BETWEEN 1961 AND 1976) AND a.country='CA' AND a.province='BC' AND p.gender='f' AND NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid = b.userid) AND block.userid IS NOT NULL AND EXISTS ( SELECT 1 FROM preference f, profile p1 WHERE p1.userid='Joe' AND p.userid = f.userif AND 2002-p1.year BETWEEN f.minage AND f.maxage); Vernon Wu wrote: > > Ron, > > The gender is indexed. Each user has account and preference, but not necessary block. > > I am currently seeking for query optimisation, not system configuration optimisation > > 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote: > > >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote: > >> I have the following query: > >> > >> SELECT p.userid, p.year, a.country, a.province, a.city > >> FROM profile p, account a > >> WHERE p.userid=a.userid AND > >> (p.year BETWEEN 1961 AND 1976) AND > >> a.country='CA' AND > >> a.province='BC' AND > >> p.gender='f' AND > >> p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND > >> block.userid IS NOT NULL AND > >> p.userid IN > >> (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN > >> f.minage AND f.maxage) > >> > >> In plain English, it is that > >> > >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what > they > >> prefer. > >> > >> The query plan is the followings: > >> > >> Nested Loop (cost=0.00..127.12 rows=995 width=894) > >> -> Nested Loop (cost=0.00..97.17 rows=1 width=894) > >> -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) > >> -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) > >> SubPlan > >> -> Materialize (cost=22.50..22.50 rows=5 width=55) > >> -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 > >> ) > >> -> Materialize (cost=44.82..44.82 rows=111 width=89) > >> -> Nested Loop (cost=0.00..44.82 rows=111 width=89) > >> -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) > >> -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) > >> -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) > >>
On Thu, Dec 05, 2002 at 11:08:17AM -0800, Vernon Wu wrote: > Ron, > > The gender is indexed. Given that gender only has two (? Very few, anyway) values, I can't believe an index will be much use: it's not very selective. Maybe combining several columns in one index will help you. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
I just learnt the "common knowledge" about four hourse ago. That does help to improve the performance indeed according to the explain command. 12/5/2002 11:45:26 AM, Jean-Luc Lachance <jllachan@nsd.ca> wrote: >It is now common knowledge that the IN clause should be rewriten as an >EXISTS. > >SELECT p.userid, p.year, a.country, a.province, a.city >FROM profile p, account a >WHERE p.userid=a.userid AND > (p.year BETWEEN 1961 AND 1976) AND > a.country='CA' AND > a.province='BC' AND > p.gender='f' AND > NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid >= b.userid) AND > block.userid IS NOT NULL AND > EXISTS ( SELECT 1 FROM preference f, profile p1 > WHERE p1.userid='Joe' AND p.userid = f.userif AND > 2002-p1.year BETWEEN f.minage AND f.maxage); > > > >Vernon Wu wrote: >> >> Ron, >> >> The gender is indexed. Each user has account and preference, but not necessary block. >> >> I am currently seeking for query optimisation, not system configuration optimisation >> >> 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote: >> >> >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote: >> >> I have the following query: >> >> >> >> SELECT p.userid, p.year, a.country, a.province, a.city >> >> FROM profile p, account a >> >> WHERE p.userid=a.userid AND >> >> (p.year BETWEEN 1961 AND 1976) AND >> >> a.country='CA' AND >> >> a.province='BC' AND >> >> p.gender='f' AND >> >> p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND >> >> block.userid IS NOT NULL AND >> >> p.userid IN >> >> (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN >> >> f.minage AND f.maxage) >> >> >> >> In plain English, it is that >> >> >> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what >> they >> >> prefer. >> >> >> >> The query plan is the followings: >> >> >> >> Nested Loop (cost=0.00..127.12 rows=995 width=894) >> >> -> Nested Loop (cost=0.00..97.17 rows=1 width=894) >> >> -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) >> >> -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) >> >> SubPlan >> >> -> Materialize (cost=22.50..22.50 rows=5 width=55) >> >> -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 >> >> ) >> >> -> Materialize (cost=44.82..44.82 rows=111 width=89) >> >> -> Nested Loop (cost=0.00..44.82 rows=111 width=89) >> >> -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) >> >> -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) >> >> -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) >> >> > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html >
Good for you! Too bad the parser does not know about it... Vernon Wu wrote: > > I just learnt the "common knowledge" about four hourse ago. That does help to improve the performance indeed > according to the explain command. > > 12/5/2002 11:45:26 AM, Jean-Luc Lachance <jllachan@nsd.ca> wrote: > > >It is now common knowledge that the IN clause should be rewriten as an > >EXISTS. > > > >SELECT p.userid, p.year, a.country, a.province, a.city > >FROM profile p, account a > >WHERE p.userid=a.userid AND > > (p.year BETWEEN 1961 AND 1976) AND > > a.country='CA' AND > > a.province='BC' AND > > p.gender='f' AND > > NOT EXISTS ( SELECT 1 FROM block b WHERE b.personid='Joe' AND p.userid > >= b.userid) AND > > block.userid IS NOT NULL AND > > EXISTS ( SELECT 1 FROM preference f, profile p1 > > WHERE p1.userid='Joe' AND p.userid = f.userif AND > > 2002-p1.year BETWEEN f.minage AND f.maxage); > > > > > > > >Vernon Wu wrote: > >> > >> Ron, > >> > >> The gender is indexed. Each user has account and preference, but not necessary block. > >> > >> I am currently seeking for query optimisation, not system configuration optimisation > >> > >> 12/4/2002 9:26:48 PM, Ron Johnson <ron.l.johnson@cox.net> wrote: > >> > >> >On Wed, 2002-12-04 at 18:26, Vernon Wu wrote: > >> >> I have the following query: > >> >> > >> >> SELECT p.userid, p.year, a.country, a.province, a.city > >> >> FROM profile p, account a > >> >> WHERE p.userid=a.userid AND > >> >> (p.year BETWEEN 1961 AND 1976) AND > >> >> a.country='CA' AND > >> >> a.province='BC' AND > >> >> p.gender='f' AND > >> >> p.userid NOT IN (SELECT b.userid FROM block b WHERE b.personid='Joe') AND > >> >> block.userid IS NOT NULL AND > >> >> p.userid IN > >> >> (SELECT f.userid FROM preference f, profile p1 WHERE p1.userid='Joe' AND 2002-p1.year BETWEEN > >> >> f.minage AND f.maxage) > >> >> > >> >> In plain English, it is that > >> >> > >> >> Joe finds females between the ages in the location who is not in the block table, while Joe's age is between what > >> they > >> >> prefer. > >> >> > >> >> The query plan is the followings: > >> >> > >> >> Nested Loop (cost=0.00..127.12 rows=995 width=894) > >> >> -> Nested Loop (cost=0.00..97.17 rows=1 width=894) > >> >> -> Seq Scan on account a (cost=0.00..25.00 rows=1 width=289) > >> >> -> Index Scan using pk_profile on profile p (cost=0.00..72.16 rows=1 width=605) > >> >> SubPlan > >> >> -> Materialize (cost=22.50..22.50 rows=5 width=55) > >> >> -> Seq Scan on block b (cost=0.00..22.50 rows=5 width=55 > >> >> ) > >> >> -> Materialize (cost=44.82..44.82 rows=111 width=89) > >> >> -> Nested Loop (cost=0.00..44.82 rows=111 width=89) > >> >> -> Index Scan using pk_profile on profile p1 (cost=0.00..4.82 rows=1 width=12) > >> >> -> Seq Scan on preference f (cost=0.00..20.00 rows=1000 width=77) > >> >> -> Seq Scan on block (cost=0.00..20.00 rows=995 width=0) > >> >> > > > >---------------------------(end of broadcast)--------------------------- > >TIP 5: Have you checked our extensive FAQ? > > > >http://www.postgresql.org/users-lounge/docs/faq.html > >
Vernon Wu wrote: > > The followings are the analyise outcomes after I did some modifications with the query. My finding is that the new query > does improve the performance according to the plan. The actual time is reversed might due to the fact the test data is > very small (the machine is a very old one by the way). The userid is the key for all tables and the gender is indexed.Do I > also index the country and province to improve the preformance? You start by using a dataset of realistic size. Sorry, but if the actual time is < 10.00 ms it is rather pointless to optimize further since chance is going to be the biggest factor. And the IN/EXISTS difference is dependent on dataset size. Jochem
12/5/2002 1:06:03 PM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote: >You start by using a dataset of realistic size. Sorry, but if the actual >time is < 10.00 ms it is rather pointless to optimize further since >chance is going to be the biggest factor. And the IN/EXISTS difference >is dependent on dataset size. > Do you mean that using "EXIST" is not necessary out-perform using 'IN" even the "explain" say so? What is the right size for those two key words? Thanks for your very hepful information. Vernon
Vernon Wu wrote: > 12/5/2002 1:06:03 PM, Jochem van Dieten <jochemd@oli.tudelft.nl> wrote: > >>You start by using a dataset of realistic size. Sorry, but if the actual >>time is < 10.00 ms it is rather pointless to optimize further since >>chance is going to be the biggest factor. And the IN/EXISTS difference >>is dependent on dataset size. > > Do you mean that using "EXIST" is not necessary out-perform using 'IN" even the "explain" say so? What is the right > size for those two key words? IIRC, IN might be faster on small datasets, but EXISTS is faster on big ones. So you have to optimize with a dataset that resembles the actual dataset you will be using in production as close as possible. I don't know what the size is at which one gets faster as the other. Jochem
Andrew, Following your suggestion, I have combined the year field with the gender to create a multicolumn index. That shall be better than indexing gender alone. I also create a multicolumn index (country, province, city) for the account table. Would you suggest indexing all possible fields such as ethnicity, religion , education, employment in the profile table;or based on what queries I run, to have some multicolumn indexes? BTW, do you get a lot of snow in Toronto these few days? Veronon 12/5/2002 11:57:50 AM, Andrew Sullivan <andrew@libertyrms.info> wrote: >On Thu, Dec 05, 2002 at 11:08:17AM -0800, Vernon Wu wrote: >> Ron, >> >> The gender is indexed. > >Given that gender only has two (? Very few, anyway) values, I can't >believe an index will be much use: it's not very selective. Maybe >combining several columns in one index will help you. > >A > >-- >---- >Andrew Sullivan 204-4141 Yonge Street >Liberty RMS Toronto, Ontario Canada ><andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >
On Thu, Dec 05, 2002 at 04:04:28PM -0500, Jean-Luc Lachance wrote: > Good for you! Too bad the parser does not know about it... LAst I heard, there was a problem about providing a rigourous mathematical proof that NOT EXISTS and NOT IN are really the same. If you can prove it, I'm sure people would be pleased. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, Dec 05, 2002 at 03:19:29PM -0800, Vernon Wu wrote: > Andrew, > > Would you suggest indexing all possible fields such as ethnicity, > religion , education, employment in the profile table; or based on > what queries I run, to have some multicolumn indexes? Never index anything more than you need. There is a fairly serious penalty at insertion time for indexes, so you can reduce some overhead that way. Note, too, that index space is not recaptured by Postgres's VACUUM, which imposes a small performance cost, but can be a real disk-gobbler if you're not careful. > BTW, do you get a lot of snow in Toronto these few days? We had some a few weeks ago. It's pretty clear right now. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Thu, 5 Dec 2002, Vernon Wu wrote: > Andrew, > > Following your suggestion, I have combined the year field with the gender to create a multicolumn index. That shall be > better than indexing gender alone. I also create a multicolumn index (country, province, city) for the account table. > > Would you suggest indexing all possible fields such as ethnicity, religion , education, employment in the profile table;or > based on what queries I run, to have some multicolumn indexes? > > BTW, do you get a lot of snow in Toronto these few days? Vernon, just so you know, for multi-column indexes to be useful in Postgresql, the columns need to be used in the same order they are declared in the index if you are using them for an order by. select * from table order by sex, age; could use the index create column table_sex_age on table (sex,age); but would not use the index create column table_age_sex on table (age,sex); However, the order in a where clause portion doesn't really seem to matter, so select * from table where sex='m' and age>=38 and select * from table where age>=38 and sex='m' should both be able to use the index. also, you can use functional indexes, but the arguments in the where clause need the same basic form to be useful. So, if you commonly make a select like this: select * from table where age>50 and age<=59; then you could make a functional index like : create index table_age_50_59 on table (age) where age>50 and age<=59; However, the query select * from table where age>50 and age<=58; Wouldn't use that index, since the age <= part doesn't match up. It could possible use a generic index on age though, i.e. one like create index table_age on table (age); But that index will be larger than the partial one, and so the planner may skip using it and use a seq scan instead. Hard to say until your database is populated with some representational test data. Since these indexes will be only a small fraction of the total data, it will often be advantageous to use them with a query. After you have a set of test data, then you can start looking at tuning random page cost and such to make your hardware perform properly for individual queries. Well, hope that helps.
12/5/2002 4:18:10 PM, "scott.marlowe" <scott.marlowe@ihs.com> wrote: >Vernon, just so you know, for multi-column indexes to be useful in >Postgresql, the columns need to be used in the same order they are >declared in the index if you are using them for an order by. > >select * from table order by sex, age; > >could use the index > >create column table_sex_age on table (sex,age); > >but would not use the index > >create column table_age_sex on table (age,sex); > I haven't have this case yet, might apply for some queries soon. >However, the order in a where clause portion doesn't really seem to >matter, so > >select * from table where sex='m' and age>=38 > >and > >select * from table where age>=38 and sex='m' > >should both be able to use the index. > >also, you can use functional indexes, but the arguments in the where >clause need the same basic form to be useful. So, if you commonly make a >select like this: > >select * from table where age>50 and age<=59; > >then you could make a functional index like : > >create index table_age_50_59 on table (age) where age>50 and age<=59; > >However, the query > >select * from table where age>50 and age<=58; > >Wouldn't use that index, since the age <= part doesn't match up. It could >possible use a generic index on age though, i.e. one like > >create index table_age on table (age); > I didn't know the functional index. Thanks for the eductional information. >But that index will be larger than the partial one, and so the planner may >skip using it and use a seq scan instead. Hard to say until your database >is populated with some representational test data. > >Since these indexes will be only a small fraction of the total data, it >will often be advantageous to use them with a query. > >After you have a set of test data, then you can start looking at tuning >random page cost and such to make your hardware perform properly for >individual queries. Well, hope that helps. > > I will do some fine query tuning in the final test phase. Right now, I want to make sure the table design and queries are on the right track. That indeed helps. Thanks, Vernon