Thread: How to improve speed of 3 table join &group (HUGE tables)
I am trying to join three quite large tables, and the query is unbearably slow(meaning I can't get results in more than a day of processing). I've tried the basic optimizations I understand, and nothing has improved the execute speed.... any help with this would be greatly appreciated The three tables are quite large: sequence_fragment = 4.5 million rows sequence_external_info = 10million rows sequence_alignment = 500 million rows The query I am attempting to run is this: select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id)) from sequence_alignment sa, sequence_fragment sf, fragment_external_info fio where sf.seq_frag_id = fio.sequence_frag_id and sf.sequence_id = sa.sequence_id group by sf.library_id, fio.clip_type NOTES: ~there are indexes on all of the fields being joined (but not on library_id or clip_type ). ~Everything has been re-analyzed post index creation ~I've tried "set enable_seqscan=off" and set (join_table_order or something) = 1 The explain plan is as follows: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=1443436673.93..1480593403.29 rows=54 width=16) -> Sort (cost=1443436673.93..1452725856.10 rows=3715672868 width=16) Sort Key: sf.library_id, fio.clip_type -> Merge Join (cost=263624049.25..319410068.18 rows=3715672868 width=16) Merge Cond: (sf.sequence_id = sa.sequence_id) -> Sort (cost=38102888.77..38128373.54 rows=10193906 width=16) Sort Key: sf.sequence_id -> Hash Join (cost=5305576.14..36080036.76 rows=10193906 width=16) Hash Cond: (fio.sequence_frag_id = sf.seq_frag_id) -> Index Scan using frag_ext_info_seq_frag_id on fragment_external_info fio (cost=0.00..30450510.27 rows=10193906 width=12) -> Hash (cost=5223807.54..5223807.54 rows=4453728 width=12) -> Index Scan using seq_frag_seqid_ind on sequence_fragment sf (cost=0.00..5223807.54 rows=4453728 width=12) -> Sort (cost=225521160.48..226688766.88 rows=467042560 width=4) Sort Key: sa.sequence_id -> Seq Scan on sequence_alignment sa (cost=100000000.00..110379294.60 rows=467042560 width=4) 15 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms] Thanks in advance! John Major
John Major skrev: > I am trying to join three quite large tables, and the query is > unbearably slow(meaning I can't get results in more than a day of > processing). > I've tried the basic optimizations I understand, and nothing has > improved the execute speed.... any help with this would be greatly > appreciated > > > The three tables are quite large: > sequence_fragment = 4.5 million rows > sequence_external_info = 10million rows > sequence_alignment = 500 million rows > > > The query I am attempting to run is this: > > select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id)) > from sequence_alignment sa, sequence_fragment sf, > fragment_external_info fio > where sf.seq_frag_id = fio.sequence_frag_id > and sf.sequence_id = sa.sequence_id > group by sf.library_id, fio.clip_type > > > NOTES: > ~there are indexes on all of the fields being joined (but not on > library_id or clip_type ). ~Everything has been re-analyzed post index > creation What are the primary (and candidate) keys of the tables? Are any of the fields nullable? How many distinct values exist for sequence_alignment.sequence_id? > ~I've tried "set enable_seqscan=off" and set (join_table_order or > something) = 1 It would help if you turned the settings back to defaults before doing the ANALYZE - or provide the results of that case as well. > The explain plan is as follows: [cut] Without trying to understand the ANALYZE output, I would suggest two possible optimizations: - Using count(distinct(sf.sequence_id)) instead of count(distinct(sa.sequence_id)). - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN (SELECT sequence_id from sequence_alignment)". The first one probably won't help (nor hurt), but the second one might be able to get rid of the table scan, or at least the need do the full merge join (which returns an estimated 3 billion rows). Hope this helps, Nis
John Major wrote: > ~there are indexes on all of the fields being joined (but not on > library_id or clip_type ). ~Everything has been re-analyzed post index > creation > ~I've tried "set enable_seqscan=off" and set (join_table_order or > something) = 1 Seqscanning and sorting a table is generally faster than a full scan of the table using an index scan, unless the heap is roughly in the index order. You probably need to CLUSTER the tables to use the indexes effectively. Are you sure you have an index on sequence_alignment.sequence_id? The planner seems to choose a seqscan + sort, even though you've set enable_seqscan=false. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hello Nis- I did reset the defaults before running the explain. Primary keys for the tables. sequence_fragment.seq_frag_id sequence.sequence_id Candidate keys. fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id) sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id). None of the fields are nullable. sequence is the anchor table. seq_frag_id is the primary key (and foreign key to fragment_external_info) ~4.5 million unique entries sequence_id is an indexed field. ~3 million distinct IDs sequence_alignment has 500million entries, but i join on sequence_id which has ~3million entries. When I make the suggested changes, the new query is: select sf.library_id, fio.clip_type , count(sf.sequence_id) from sequence_fragment sf, fragment_external_info fio where sf.seq_frag_id = fio.sequence_frag_id and sf.sequence_id IN (SELECT sequence_id from sequence_alignment) group by sf.library_id, fio.clip_type After making the 2 changes, the cost dropped dramatically... but is still very high. Original Explain cost: cost=1308049564..1345206293 rows=54 width=16 New Explain cost: cost=11831119..11831120 rows=54 width=16 John Nis Jørgensen wrote: > John Major skrev: > >> I am trying to join three quite large tables, and the query is >> unbearably slow(meaning I can't get results in more than a day of >> processing). >> I've tried the basic optimizations I understand, and nothing has >> improved the execute speed.... any help with this would be greatly >> appreciated >> >> >> The three tables are quite large: >> sequence_fragment = 4.5 million rows >> sequence_external_info = 10million rows >> sequence_alignment = 500 million rows >> >> >> The query I am attempting to run is this: >> >> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id)) >> from sequence_alignment sa, sequence_fragment sf, >> fragment_external_info fio >> where sf.seq_frag_id = fio.sequence_frag_id >> and sf.sequence_id = sa.sequence_id >> group by sf.library_id, fio.clip_type >> >> >> NOTES: >> ~there are indexes on all of the fields being joined (but not on >> library_id or clip_type ). ~Everything has been re-analyzed post index >> creation >> > > What are the primary (and candidate) keys of the tables? Are any of the > fields nullable? How many distinct values exist for > sequence_alignment.sequence_id? > > >> ~I've tried "set enable_seqscan=off" and set (join_table_order or >> something) = 1 >> > > > It would help if you turned the settings back to defaults before doing > the ANALYZE - or provide the results of that case as well. > > >> The explain plan is as follows: >> > > [cut] > > Without trying to understand the ANALYZE output, I would suggest two > possible optimizations: > > - Using count(distinct(sf.sequence_id)) instead of > count(distinct(sa.sequence_id)). > > - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN > (SELECT sequence_id from sequence_alignment)". > > The first one probably won't help (nor hurt), but the second one might > be able to get rid of the table scan, or at least the need do the full > merge join (which returns an estimated 3 billion rows). > > Hope this helps, > > Nis > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Hi Hekki- When I turn seq_scan off for the new query: explain select sf.library_id, fio.clip_type , count(sf.sequence_id) from sequence_fragment sf, fragment_external_info fio where sf.seq_frag_id = fio.sequence_frag_id and sf.sequence_id IN (SELECT sequence_id from sequence_alignment) group by sf.library_id, fio.clip_type The index is used... but the cost gets worse! it goes from: 11831119 -TO- 53654888 Actually... The new query executes in ~ 15 minutes... which is good enough for me for now. Thanks Nis! john Heikki Linnakangas wrote: > John Major wrote: > >> ~there are indexes on all of the fields being joined (but not on >> library_id or clip_type ). ~Everything has been re-analyzed post index >> creation >> ~I've tried "set enable_seqscan=off" and set (join_table_order or >> something) = 1 >> > > Seqscanning and sorting a table is generally faster than a full scan of > the table using an index scan, unless the heap is roughly in the index > order. You probably need to CLUSTER the tables to use the indexes > effectively. > > Are you sure you have an index on sequence_alignment.sequence_id? The > planner seems to choose a seqscan + sort, even though you've set > enable_seqscan=false. > >
Hi, how about: select sf.library_id, fio.clip_type , count(sf.sequence_id) from sequence_fragment sf, fragment_external_info fio ,(SELECT distinct sequence_id from sequence_alignment) sa where sf.seq_frag_id = fio.sequence_frag_id and sf.sequence_id = sa.sequence_id group by sf.library_id, fio.clip_type I don't know postgres well, but I would put my bet in Oracle in that derived table instead of that in clause. Ismo On Thu, 18 Oct 2007, John Major wrote: > Hi Hekki- > > When I turn seq_scan off for the new query: > > explain > select sf.library_id, fio.clip_type , count(sf.sequence_id) > from sequence_fragment sf, fragment_external_info fio > where sf.seq_frag_id = fio.sequence_frag_id > and sf.sequence_id IN > (SELECT sequence_id from sequence_alignment) > group by sf.library_id, fio.clip_type > > The index is used... but the cost gets worse! > it goes from: > 11831119 > -TO- > 53654888 > > Actually... The new query executes in ~ 15 minutes... which is good enough for > me for now. > > Thanks Nis! > > john > > > > Heikki Linnakangas wrote: > > John Major wrote: > > > > > ~there are indexes on all of the fields being joined (but not on > > > library_id or clip_type ). ~Everything has been re-analyzed post index > > > creation > > > ~I've tried "set enable_seqscan=off" and set (join_table_order or > > > something) = 1 > > > > > > > Seqscanning and sorting a table is generally faster than a full scan of > > the table using an index scan, unless the heap is roughly in the index > > order. You probably need to CLUSTER the tables to use the indexes > > effectively. > > > > Are you sure you have an index on sequence_alignment.sequence_id? The > > planner seems to choose a seqscan + sort, even though you've set > > enable_seqscan=false. > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > > >
John Major skrev: > Hello Nis- > > I did reset the defaults before running the explain. This line from your original post: -> Seq Scan on sequence_alignment sa (cost=100000000.00..110379294.60 rows=467042560 width=4) Is an indication that you didn't (AFAIK enable_seqscan=off works by setting the cost of starting a seqscan to 100000000). > Candidate keys. > fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id) > sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id). Those are not candidate keys. A candidate key is "something which could have been chosen as the primary key". Anyway, I think I understand your table layout now. It might have been quicker if you just posted the definition of your tables. This could also have shown us that the correct indexes are in place, rather than taking your word for it. You are absolutely certain that both sides of all FK relationships are indexed? > After making the 2 changes, the cost dropped dramatically... but is still very high. > Original Explain cost: > cost=1308049564..1345206293 rows=54 width=16 > > New Explain cost: > cost=11831119..11831120 rows=54 width=16 Please post the full output if you want more help. And preferably use EXPLAIN ANALYZE, now that it runs in finite time. Nis