Thread: JOIN not being calculated correctly

JOIN not being calculated correctly

From
"Scott Pederick"
Date:
Hi all!

I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and a
particular join.

I've got two tables - a list of customers and jobs they've had. A customer
can have multiple jobs.

The query always scans the entire jobs table for each customer - I need it
the other way around so I can get a list of the customers who have at least
one job.

The EXPLAIN shows the jobs table is being scanned for some reason:

Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
INNER JOIN Jobs USING (CustomerId);                              QUERY PLAN
-------------------------------------------------------------------------Hash Join  (cost=78.54..4908.71 rows=70727
width=8) Hash Cond: ("outer".customerid = "inner".customerid)  ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727
width=8) ->  Hash  (cost=76.03..76.03 rows=1003 width=4)        ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003
width=4)
(5 rows)



Even if I reverse the JOIN I get the exact same result:

Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
JOIN Customers USING (CustomerId);                              QUERY PLAN
-------------------------------------------------------------------------Hash Join  (cost=78.54..4908.71 rows=70727
width=8) Hash Cond: ("outer".customerid = "inner".customerid)  ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727
width=8) ->  Hash  (cost=76.03..76.03 rows=1003 width=4)        ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003
width=4)
(5 rows)


How can I force it to operate as I need it to? It seems the query engine is
a little smarter than it needs to be.

If anyone can shed some light on this problem, it would be greatly
appreciated. I've taken it as far as I can and don't really know where to
move from here.

Thanks in advance,

Scott Pederick



Re: JOIN not being calculated correctly

From
Andrew Hammond
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Posting the EXPLAIN is a good, an EXPLAIN ANALYZE would be better
(assuming your dataset is small enough for it to complete in this
lifetime). You also need to include the following information:

1) The schema involved, including information about indexes being used.

2) Have you vacuumed / analyzed the tables involved recently?

3) Have you modified the stats on any of the tables / columns involve or
are you using defaults?

Drew


Scott Pederick wrote:
| Hi all!
|
| I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and a
| particular join.
|
| I've got two tables - a list of customers and jobs they've had. A customer
| can have multiple jobs.
|
| The query always scans the entire jobs table for each customer - I need it
| the other way around so I can get a list of the customers who have at
least
| one job.
|
| The EXPLAIN shows the jobs table is being scanned for some reason:
|
| Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
| INNER JOIN Jobs USING (CustomerId);
|                                QUERY PLAN
| -------------------------------------------------------------------------
|  Hash Join  (cost=78.54..4908.71 rows=70727 width=8)
|    Hash Cond: ("outer".customerid = "inner".customerid)
|    ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727 width=8)
|    ->  Hash  (cost=76.03..76.03 rows=1003 width=4)
|          ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003 width=4)
| (5 rows)
|
|
|
| Even if I reverse the JOIN I get the exact same result:
|
| Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
| JOIN Customers USING (CustomerId);
|                                QUERY PLAN
| -------------------------------------------------------------------------
|  Hash Join  (cost=78.54..4908.71 rows=70727 width=8)
|    Hash Cond: ("outer".customerid = "inner".customerid)
|    ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727 width=8)
|    ->  Hash  (cost=76.03..76.03 rows=1003 width=4)
|          ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003 width=4)
| (5 rows)
|
|
| How can I force it to operate as I need it to? It seems the query
engine is
| a little smarter than it needs to be.
|
| If anyone can shed some light on this problem, it would be greatly
| appreciated. I've taken it as far as I can and don't really know where to
| move from here.


- --
Andrew Hammond    416-673-4138    ahammond@ca.afilias.info
Database Administrator, Afilias Canada Corp.
CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBh9YQgfzn5SevSpoRAg0LAKCg5K7IccFIOvdTc8DEl2YaUMcUCgCfRt2Q
CI1Vo6yxHkrWcoTQMQ/EvOw=
=m15B
-----END PGP SIGNATURE-----


Re: JOIN not being calculated correctly

From
"Scott Pederick"
Date:
Hi Andrew,

Thanks for your response - sorry it's taken so long to reply, I've been out
of contact for a while.

Still can't fathom why it's not using an index scan on the jobs table..

: Posting the EXPLAIN is a good, an EXPLAIN ANALYZE would be better
: (assuming your dataset is small enough for it to complete in this
: lifetime). You also need to include the following information:
:
: 1) The schema involved, including information about indexes being used.

Here's the EXPLAIN ANALYZE, schemas and indexes:

Quarry=> EXPLAIN ANALYZE SELECT Customers.CustomerId, Jobs.JobId FROM
Customers INNER JOIN Jobs USING (CustomerId);                                                     QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------Hash Join  (cost=78.54..7924.71 rows=70727 width=8) (actual
time=50.000..21941.000 rows=70727 loops=1)  Hash Cond: ("outer".customerid = "inner".customerid)  ->  Seq Scan on jobs
(cost=0.00..6785.27rows=70727 width=8) (actual
 
time=40.000..21040.000 rows=70727 loops=1)  ->  Hash  (cost=76.03..76.03 rows=1003 width=4) (actual
time=10.000..10.000 rows=0 loops=1)        ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003 width=4)
(actual time=0.000..0.000 rows=1003 loops=1)Total runtime: 22292.000 ms
(6 rows)


Customers
---
CREATE TABLE customers (   customerid serial NOT NULL,   account character varying(6) DEFAULT ''::character varying NOT
NULL,  businessname character varying(40) DEFAULT ''::character varying NOT
 
NULL,   address character varying(30) DEFAULT ''::character varying NOT NULL,   suburb character varying(30) DEFAULT
''::charactervarying NOT NULL,   postcode character varying(4) DEFAULT ''::character varying NOT NULL,   state
character(3)DEFAULT ''::bpchar NOT NULL,   mobile character varying(14) DEFAULT ''::character varying NOT NULL,   fax
charactervarying(14) DEFAULT ''::character varying NOT NULL,   acctitle character varying(4) DEFAULT ''::character
varyingNOT NULL,   accfirst character varying(14) DEFAULT ''::character varying NOT NULL,   acclast character
varying(14)DEFAULT ''::character varying NOT NULL,   accphone character varying(14) DEFAULT ''::character varying NOT
NULL,  accfax character(1) DEFAULT ''::bpchar NOT NULL,   accmobile character varying(14) DEFAULT ''::character varying
NOTNULL,   alttitle character varying(4) DEFAULT ''::character varying NOT NULL,   altfirst character varying(14)
DEFAULT''::character varying NOT NULL,   altlast character varying(14) DEFAULT ''::character varying NOT NULL,
currentnumeric(10,2) DEFAULT 0 NOT NULL,   aged30 numeric(10,2) DEFAULT 0 NOT NULL,   aged60 numeric(10,2) DEFAULT 0
NOTNULL,   aged90 numeric(10,2) DEFAULT 0 NOT NULL,   ytd numeric(10,2) DEFAULT 0 NOT NULL,   pg character(1) DEFAULT
''::bpcharNOT NULL,   acclimit character varying(14) DEFAULT ''::character varying NOT NULL,   accauto character(1)
DEFAULT''::bpchar NOT NULL,   text text DEFAULT ''::text NOT NULL,   fee character(1) DEFAULT ''::bpchar NOT NULL,
gstcharacter(1) DEFAULT ''::bpchar NOT NULL,   stop character(1) DEFAULT ''::bpchar NOT NULL
 
);

Customers - Indexes
---
customers_pkey PRIMARY KEY (customerid);
UNIQUE INDEX customers_account ON customers USING btree (account);



Jobs
---
CREATE TABLE jobs (   jobid serial NOT NULL,   customerid integer,   quarryid integer,   producttypeid integer,
invoiceidinteger,   salesid integer,   orderid character varying(15) DEFAULT ''::character varying NOT NULL,   jobdate
date,  contactname character varying(40) DEFAULT ''::character varying NOT
 
NULL,   businessname character varying(40) DEFAULT ''::character varying NOT
NULL,   address character varying(30) DEFAULT ''::character varying NOT NULL,   suburb character varying(30) DEFAULT
''::charactervarying NOT NULL,   phone character varying(14) DEFAULT ''::character varying NOT NULL,   melway character
varying(4)DEFAULT ''::character varying NOT NULL,   melxy character(3) DEFAULT ''::bpchar NOT NULL,   quarryorder
numeric(10,2)DEFAULT 0 NOT NULL,   quarrypickup numeric(10,2) DEFAULT 0 NOT NULL,   quarryprice numeric(10,2) DEFAULT 0
NOTNULL,   transport numeric(10,2) DEFAULT 0 NOT NULL,   margin numeric(10,2) DEFAULT 0 NOT NULL,   unit character(1)
DEFAULT''::bpchar NOT NULL,   done character(1) DEFAULT ''::bpchar NOT NULL,   xno character varying(10) DEFAULT
''::charactervarying NOT NULL,   mzone character(1) DEFAULT ''::bpchar NOT NULL,   invoice numeric(10,2) DEFAULT 0 NOT
NULL,  drivercomment character varying(12) DEFAULT ''::character varying NOT
 
NULL   text text DEFAULT ''::text NOT NULL,   gst numeric(10,2) DEFAULT 0 NOT NULL,   distance numeric(7,2) DEFAULT 0
NOTNULL,   productdefinitionid integer,   customerprice numeric(10,2) DEFAULT 0 NOT NULL
 
);

Jobs Keys/Indexes
---
jobs_pkey PRIMARY KEY (jobid);
INDEX jobs_customerid ON jobs USING btree (customerid);
INDEX jobs_jobdate ON jobs USING btree (jobdate);
INDEX jobs_quarryid ON jobs USING btree (quarryid);

CONSTRAINT jobs_customerid_fkey FOREIGN KEY (customerid) REFERENCES
customers(customerid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_invoiceid_fkey FOREIGN KEY (invoiceid) REFERENCES
invoices(invoiceid) ON UPDATE RESTRICT ON DELETE RESTRICT;
ADD CONSTRAINT jobs_productdefinitionid_fkey FOREIGN KEY
(productdefinitionid) REFERENCES producttypes(producttypeid) ON UPDATE
RESTRICT;
CONSTRAINT jobs_producttypeid_fkey FOREIGN KEY (producttypeid) REFERENCES
producttypes(producttypeid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_quarryid_fkey FOREIGN KEY (quarryid) REFERENCES
quarries(quarryid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_salesid_fkey FOREIGN KEY (salesid) REFERENCES sales(salesid)
ON UPDATE RESTRICT ON DELETE RESTRICT;



: 2) Have you vacuumed / analyzed the tables involved recently?

Yes, I ran "VACUUM ANALYZE" on both table before I started.


: 3) Have you modified the stats on any of the tables / columns involve or
: are you using defaults?

I'm using the defaults...

Thanks in advance to anyone who has taken the time to wade through this
one... :)

Scott

:
: Drew
:
:
: Scott Pederick wrote:
: | Hi all!
: |
: | I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and
a
: | particular join.
: |
: | I've got two tables - a list of customers and jobs they've had. A
customer
: | can have multiple jobs.
: |
: | The query always scans the entire jobs table for each customer - I need
it
: | the other way around so I can get a list of the customers who have at
: least
: | one job.
: |
: | The EXPLAIN shows the jobs table is being scanned for some reason:
: |
: | Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
: | INNER JOIN Jobs USING (CustomerId);
: |                                QUERY PLAN
:
| -------------------------------------------------------------------------
: |  Hash Join  (cost=78.54..4908.71 rows=70727 width=8)
: |    Hash Cond: ("outer".customerid = "inner".customerid)
: |    ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727 width=8)
: |    ->  Hash  (cost=76.03..76.03 rows=1003 width=4)
: |          ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003 width=4)
: | (5 rows)
: |
: |
: |
: | Even if I reverse the JOIN I get the exact same result:
: |
: | Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
: | JOIN Customers USING (CustomerId);
: |                                QUERY PLAN
:
| -------------------------------------------------------------------------
: |  Hash Join  (cost=78.54..4908.71 rows=70727 width=8)
: |    Hash Cond: ("outer".customerid = "inner".customerid)
: |    ->  Seq Scan on jobs  (cost=0.00..3769.27 rows=70727 width=8)
: |    ->  Hash  (cost=76.03..76.03 rows=1003 width=4)
: |          ->  Seq Scan on customers  (cost=0.00..76.03 rows=1003 width=4)
: | (5 rows)
: |
: |
: | How can I force it to operate as I need it to? It seems the query
: engine is
: | a little smarter than it needs to be.
: |
: | If anyone can shed some light on this problem, it would be greatly
: | appreciated. I've taken it as far as I can and don't really know where
to
: | move from here.
:
:
: - --
: Andrew Hammond    416-673-4138    ahammond@ca.afilias.info
: Database Administrator, Afilias Canada Corp.
: CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
: -----BEGIN PGP SIGNATURE-----
: Version: GnuPG v1.2.5 (GNU/Linux)
:
: iD8DBQFBh9YQgfzn5SevSpoRAg0LAKCg5K7IccFIOvdTc8DEl2YaUMcUCgCfRt2Q
: CI1Vo6yxHkrWcoTQMQ/EvOw=
: =m15B
: -----END PGP SIGNATURE-----
:
: ---------------------------(end of broadcast)---------------------------
: TIP 3: if posting/reading through Usenet, please send an appropriate
:       subscribe-nomail command to majordomo@postgresql.org so that your
:       message can get through to the mailing list cleanly
:



Re: JOIN not being calculated correctly

From
Tom Lane
Date:
"Scott Pederick" <sql.postgresql.org@pederick.com> writes:
> Still can't fathom why it's not using an index scan on the jobs table..

Why exactly do you think that would make it faster?

The query evidently requires visiting every single jobs row, so a
seqscan seems appropriate to me; indeed I'd say the planner picked
exactly the perfect plan.  If you think not, try forcing other plan
choices and see what happens to the runtime.
        regards, tom lane