Thread: joining two tables slow due to sequential scan
= clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus = 'AC';
= documentversions .documentidentifier where clinicaldocuments.patientidentifier= 123;
What version of postgres are you using? Can you post the output from EXPLAIN ANALYZE?
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:07 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] joining two tables slow due to sequential scan
I am trying to join two tables and keep getting a sequential scan in the plan even though there is an index on the columns I am joining on. Basically this the deal ... I have two tables with docid in them which is what I am using for the join.
ClinicalDocs ... (no primary key) though it does not help if I make docid primary key
docid integer (index)
patientid integer (index)
visitid integer (index)
...
Documentversions
docid integer (index)
docversionnumber (index)
docversionidentifier (primary key)
It seems to do an index scan if I put the primary key as docid. This is what occurs when I link on the patid from ClinicalDocs to patient table. However I can not make the docid primary key because it gets repeated depending on how may versions of a document I have. I have tried using a foreign key on documentversions with no sucess.
In addition this query
select * from documentversions join clinicaldocuments on documentversions.documentidentifier
= clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus = 'AC';
does index scan
but if I change the order e.g
select * from clinicaldocuments join documentversions on clinicaldocuments.dssdocumentidentifier
= documentversions .documentidentifier where clinicaldocuments.patientidentifier= 123;
does sequential scan what I need is bottom query
it is extremely slow ... Any ideas ?
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
On Fri, 2006-02-10 at 16:06, Tim Jones wrote: > > I am trying to join two tables and keep getting a sequential scan in > the plan even though there is an index on the columns I am joining > on. Basically this the deal ... I have two tables with docid in them > which is what I am using for the join. > SNIP > select * from documentversions join clinicaldocuments on > documentversions.documentidentifier > = clinicaldocuments.dssdocumentidentifier where > documentversions.documentstatus = 'AC'; > > does index scan > but if I change the order e.g > > select * from clinicaldocuments join documentversions on > clinicaldocuments.dssdocumentidentifier > = documentversions .documentidentifier where > clinicaldocuments.patientidentifier= 123; OK. I'm gonna make a couple of guesses here: 1: clinicaldocuments.patientidentifier is an int8 and you're running 7.4 or before. 2: There are more rows with clinicaldocuments.patientidentifier= 123 than with documentversions.documentstatus = 'AC'. 3: documentversions.documentidentifier and clinicaldocuments.dssdocumentidentifier are not the same type. Any of those things true?
OK. I'm gonna make a couple of guesses here: 1: clinicaldocuments.patientidentifier is an int8 and you're running 7.4 or before. -- nope int4 and 8.1 2: There are more rows with clinicaldocuments.patientidentifier= 123 than with documentversions.documentstatus = 'AC'. -- nope generally speaking all statuses are 'AC' 3: documentversions.documentidentifier and clinicaldocuments.dssdocumentidentifier are not the same type. -- nope both int4 Any of those things true?
On Fri, 2006-02-10 at 16:35, Tim Jones wrote: > OK. I'm gonna make a couple of guesses here: > > 1: clinicaldocuments.patientidentifier is an int8 and you're running > 7.4 or before. > > -- nope int4 and 8.1 > > 2: There are more rows with clinicaldocuments.patientidentifier= 123 > than with documentversions.documentstatus = 'AC'. > > -- nope generally speaking all statuses are 'AC' > > 3: documentversions.documentidentifier and > clinicaldocuments.dssdocumentidentifier are not the same type. > > -- nope both int4 OK then, I guess we'll need to see the explain analyze output of both of those queries.
for first query QUERY PLAN 'Limit (cost=4.69..88.47 rows=10 width=1350) (actual time=32.195..32.338 rows=10 loops=1)' ' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual time=32.190..32.316 rows=10 loops=1)' ' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40 rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)' ' Recheck Cond: (documentstatus = ''AC''::bpchar)' ' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69 rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)' ' Index Cond: (documentstatus = ''AC''::bpchar)' ' -> Index Scan using ix_cdocdid on clinicaldocuments (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1 loops=10)' ' Index Cond: ("outer".documentidentifier = clinicaldocuments.dssdocumentidentifier)' for second query QUERY PLAN 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)' ' Hash Cond: ("outer".documentidentifier = "inner".dssdocumentidentifier)' ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368 width=996)' ' -> Hash (cost=898.62..898.62 rows=482 width=354)' ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62 rows=482 width=354)' ' Recheck Cond: (patientidentifier = 123)' ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69 rows=482 width=0)' ' Index Cond: (patientidentifier = 123)' thnx Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 ________________________________ From: Dave Dutcher [mailto:dave@tridecap.com] Sent: Friday, February 10, 2006 5:15 PM To: Tim Jones; pgsql-performance@postgresql.org Subject: RE: [PERFORM] joining two tables slow due to sequential scan What version of postgres are you using? Can you post the output from EXPLAIN ANALYZE? -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones Sent: Friday, February 10, 2006 4:07 PM To: pgsql-performance@postgresql.org Subject: [PERFORM] joining two tables slow due to sequential scan I am trying to join two tables and keep getting a sequential scan in the plan even though there is an index on the columns I am joining on. Basically this the deal ... I have two tables with docid in them which is what I am using for the join. ClinicalDocs ... (no primary key) though it does not help if I make docid primary key docid integer (index) patientid integer (index) visitid integer (index) ... Documentversions docid integer (index) docversionnumber (index) docversionidentifier (primary key) It seems to do an index scan if I put the primary key as docid. This is what occurs when I link on the patid from ClinicalDocs to patient table. However I can not make the docid primary key because it gets repeated depending on how may versions of a document I have. I have tried using a foreign key on documentversions with no sucess. In addition this query select * from documentversions join clinicaldocuments on documentversions.documentidentifier = clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus = 'AC'; does index scan but if I change the order e.g select * from clinicaldocuments join documentversions on clinicaldocuments.dssdocumentidentifier = documentversions .documentidentifier where clinicaldocuments.patientidentifier= 123; does sequential scan what I need is bottom query it is extremely slow ... Any ideas ? Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555
On Fri, 2006-02-10 at 16:37, Tim Jones wrote: > for first query > > QUERY PLAN > 'Limit (cost=4.69..88.47 rows=10 width=1350) (actual > time=32.195..32.338 rows=10 loops=1)' > ' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual > time=32.190..32.316 rows=10 loops=1)' > ' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40 > rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)' > ' Recheck Cond: (documentstatus = ''AC''::bpchar)' > ' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69 > rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)' > ' Index Cond: (documentstatus = ''AC''::bpchar)' > ' -> Index Scan using ix_cdocdid on clinicaldocuments > (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1 > loops=10)' > ' Index Cond: ("outer".documentidentifier = > clinicaldocuments.dssdocumentidentifier)' > > > for second query > > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)' > ' Hash Cond: ("outer".documentidentifier = > "inner".dssdocumentidentifier)' > ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368 > width=996)' > ' -> Hash (cost=898.62..898.62 rows=482 width=354)' > ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62 > rows=482 width=354)' > ' Recheck Cond: (patientidentifier = 123)' > ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69 > rows=482 width=0)' > ' Index Cond: (patientidentifier = 123)' OK, the first one is explain analyze, but the second one is just plain explain...
oops QUERY PLAN 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual time=0.203..0.203 rows=0 loops=1)' ' Hash Cond: ("outer".documentidentifier = "inner".dssdocumentidentifier)' ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368 width=996) (actual time=0.007..0.007 rows=1 loops=1)' ' -> Hash (cost=898.62..898.62 rows=482 width=354) (actual time=0.161..0.161 rows=0 loops=1)' ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62 rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)' ' Recheck Cond: (patientidentifier = 123)' ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69 rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)' ' Index Cond: (patientidentifier = 123)' 'Total runtime: 0.392 ms' note I have done these on a smaller db than what I am using but the plans are the same Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 -----Original Message----- From: Scott Marlowe [mailto:smarlowe@g2switchworks.com] Sent: Friday, February 10, 2006 5:39 PM To: Tim Jones Cc: Dave Dutcher; pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan On Fri, 2006-02-10 at 16:37, Tim Jones wrote: > for first query > > QUERY PLAN > 'Limit (cost=4.69..88.47 rows=10 width=1350) (actual > time=32.195..32.338 rows=10 loops=1)' > ' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual > time=32.190..32.316 rows=10 loops=1)' > ' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40 > rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)' > ' Recheck Cond: (documentstatus = ''AC''::bpchar)' > ' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69 > rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)' > ' Index Cond: (documentstatus = ''AC''::bpchar)' > ' -> Index Scan using ix_cdocdid on clinicaldocuments > (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1 > loops=10)' > ' Index Cond: ("outer".documentidentifier = > clinicaldocuments.dssdocumentidentifier)' > > > for second query > > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)' > ' Hash Cond: ("outer".documentidentifier = > "inner".dssdocumentidentifier)' > ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368 > width=996)' > ' -> Hash (cost=898.62..898.62 rows=482 width=354)' > ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62 > rows=482 width=354)' > ' Recheck Cond: (patientidentifier = 123)' > ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69 > rows=482 width=0)' > ' Index Cond: (patientidentifier = 123)' OK, the first one is explain analyze, but the second one is just plain explain...
"Tim Jones" <TJones@optio.com> writes: > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)' > ' Hash Cond: ("outer".documentidentifier = > "inner".dssdocumentidentifier)' This is not EXPLAIN ANALYZE output. Also, the rowcount estimates seem far enough off in the other query to make me wonder how long it's been since you ANALYZEd the tables... More generally, though, I don't see anything particularly wrong with this query plan. You're selecting enough of the table that an indexscan isn't necessarily a good plan. regards, tom lane
On Fri, 2006-02-10 at 16:43, Tim Jones wrote: > oops > > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual > time=0.203..0.203 rows=0 loops=1)' > ' Hash Cond: ("outer".documentidentifier = > "inner".dssdocumentidentifier)' > ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368 > width=996) (actual time=0.007..0.007 rows=1 loops=1)' > ' -> Hash (cost=898.62..898.62 rows=482 width=354) (actual > time=0.161..0.161 rows=0 loops=1)' > ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62 > rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)' > ' Recheck Cond: (patientidentifier = 123)' > ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69 > rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)' > ' Index Cond: (patientidentifier = 123)' > 'Total runtime: 0.392 ms' > > note I have done these on a smaller db than what I am using but the > plans are the same Hmmmm. We really need to see what's happening on the real database to see what's going wrong. i.e. if the real database thinks it'll get 30 rows and it gets back 5,000,000 that's a problem. The query planner in pgsql is cost based, so until you have real data underneath it, and analyze it, you can't really say how it will behave for you. I.e. small test sets don't work.
"Tim Jones" <TJones@optio.com> writes: > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual > time=0.203..0.203 rows=0 loops=1)' > ... > 'Total runtime: 0.392 ms' Hardly seems like evidence of a performance problem ... regards, tom lane
ok here is real db the first query I had seems to make no sense because it is only fast if I limit the rows since almost all rows have status = 'AC' second query tables both have about 10 million rows and it takes a long time as you can see but this person only has approx 160 total documents QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------- Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual time=83266.854..91166.315 rows=3 loops=1) Hash Cond: ("outer".documentidentifier = "inner".dssdocumentidentifier) -> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398 width=415) (actual time=0.056..49812.459 rows=9677398 loops=1) -> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual time=10.833..10.833 rows=3 loops=1) -> Bitmap Heap Scan on clinicaldocuments (cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258 rows=3 loops=1) Recheck Cond: (patientidentifier = 690193) -> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13 rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1) Index Cond: (patientidentifier = 690193) Total runtime: 91166.540 ms Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, February 10, 2006 5:52 PM To: Tim Jones Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan "Tim Jones" <TJones@optio.com> writes: > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual > time=0.203..0.203 rows=0 loops=1)' > ... > 'Total runtime: 0.392 ms' Hardly seems like evidence of a performance problem ... regards, tom lane
OK, if I'm reading this correctly, it looks like the planner is choosing a sequential scan because it expects 48,000 rows for that patientidentifier, but its actually only getting 3. The planner has the number of rows right for the sequential scan, so it seems like the stats are up to date. I would try increasing the stats for the patientindentifier column with 'alter table set statistics...' or increasing the default_statistics_target for the whole DB. Once you have changed the stats I believe you need to run analyze again. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones Sent: Friday, February 10, 2006 4:59 PM To: Tom Lane Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan ok here is real db the first query I had seems to make no sense because it is only fast if I limit the rows since almost all rows have status = 'AC' second query tables both have about 10 million rows and it takes a long time as you can see but this person only has approx 160 total documents QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------- Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual time=83266.854..91166.315 rows=3 loops=1) Hash Cond: ("outer".documentidentifier = "inner".dssdocumentidentifier) -> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398 width=415) (actual time=0.056..49812.459 rows=9677398 loops=1) -> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual time=10.833..10.833 rows=3 loops=1) -> Bitmap Heap Scan on clinicaldocuments (cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258 rows=3 loops=1) Recheck Cond: (patientidentifier = 690193) -> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13 rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1) Index Cond: (patientidentifier = 690193) Total runtime: 91166.540 ms Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, February 10, 2006 5:52 PM To: Tim Jones Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan "Tim Jones" <TJones@optio.com> writes: > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual > time=0.203..0.203 rows=0 loops=1)' > ... > 'Total runtime: 0.392 ms' Hardly seems like evidence of a performance problem ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
ok I am retarded :) Apparently I thought I had done analyze on these tables but I actually had not and that was all that was needed. but thanks for the help. Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 -----Original Message----- From: Dave Dutcher [mailto:dave@tridecap.com] Sent: Friday, February 10, 2006 6:25 PM To: Tim Jones Cc: pgsql-performance@postgresql.org Subject: RE: [PERFORM] joining two tables slow due to sequential scan OK, if I'm reading this correctly, it looks like the planner is choosing a sequential scan because it expects 48,000 rows for that patientidentifier, but its actually only getting 3. The planner has the number of rows right for the sequential scan, so it seems like the stats are up to date. I would try increasing the stats for the patientindentifier column with 'alter table set statistics...' or increasing the default_statistics_target for the whole DB. Once you have changed the stats I believe you need to run analyze again. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones Sent: Friday, February 10, 2006 4:59 PM To: Tom Lane Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan ok here is real db the first query I had seems to make no sense because it is only fast if I limit the rows since almost all rows have status = 'AC' second query tables both have about 10 million rows and it takes a long time as you can see but this person only has approx 160 total documents QUERY PLAN ------------------------------------------------------------------------ ------------------------------------------------------------------- Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual time=83266.854..91166.315 rows=3 loops=1) Hash Cond: ("outer".documentidentifier = "inner".dssdocumentidentifier) -> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398 width=415) (actual time=0.056..49812.459 rows=9677398 loops=1) -> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual time=10.833..10.833 rows=3 loops=1) -> Bitmap Heap Scan on clinicaldocuments (cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258 rows=3 loops=1) Recheck Cond: (patientidentifier = 690193) -> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13 rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1) Index Cond: (patientidentifier = 690193) Total runtime: 91166.540 ms Tim Jones Healthcare Project Manager Optio Software, Inc. (770) 576-3555 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Friday, February 10, 2006 5:52 PM To: Tim Jones Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org Subject: Re: [PERFORM] joining two tables slow due to sequential scan "Tim Jones" <TJones@optio.com> writes: > QUERY PLAN > 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual > time=0.203..0.203 rows=0 loops=1)' > ... > 'Total runtime: 0.392 ms' Hardly seems like evidence of a performance problem ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings