Re: How to improve speed of 3 table join &group (HUGE tables) - Mailing list pgsql-performance
From | John Major |
---|---|
Subject | Re: How to improve speed of 3 table join &group (HUGE tables) |
Date | |
Msg-id | 4717B80B.6050706@cbio.mskcc.org Whole thread Raw |
In response to | Re: How to improve speed of 3 table join &group (HUGE tables) (Nis Jørgensen <nis@superlativ.dk>) |
Responses |
Re: How to improve speed of 3 table join &group (HUGE tables)
|
List | pgsql-performance |
Hello Nis- I did reset the defaults before running the explain. Primary keys for the tables. sequence_fragment.seq_frag_id sequence.sequence_id Candidate keys. fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id) sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id). None of the fields are nullable. sequence is the anchor table. seq_frag_id is the primary key (and foreign key to fragment_external_info) ~4.5 million unique entries sequence_id is an indexed field. ~3 million distinct IDs sequence_alignment has 500million entries, but i join on sequence_id which has ~3million entries. When I make the suggested changes, the new query is: select sf.library_id, fio.clip_type , count(sf.sequence_id) from sequence_fragment sf, fragment_external_info fio where sf.seq_frag_id = fio.sequence_frag_id and sf.sequence_id IN (SELECT sequence_id from sequence_alignment) group by sf.library_id, fio.clip_type After making the 2 changes, the cost dropped dramatically... but is still very high. Original Explain cost: cost=1308049564..1345206293 rows=54 width=16 New Explain cost: cost=11831119..11831120 rows=54 width=16 John Nis Jørgensen wrote: > John Major skrev: > >> I am trying to join three quite large tables, and the query is >> unbearably slow(meaning I can't get results in more than a day of >> processing). >> I've tried the basic optimizations I understand, and nothing has >> improved the execute speed.... any help with this would be greatly >> appreciated >> >> >> The three tables are quite large: >> sequence_fragment = 4.5 million rows >> sequence_external_info = 10million rows >> sequence_alignment = 500 million rows >> >> >> The query I am attempting to run is this: >> >> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id)) >> from sequence_alignment sa, sequence_fragment sf, >> fragment_external_info fio >> where sf.seq_frag_id = fio.sequence_frag_id >> and sf.sequence_id = sa.sequence_id >> group by sf.library_id, fio.clip_type >> >> >> NOTES: >> ~there are indexes on all of the fields being joined (but not on >> library_id or clip_type ). ~Everything has been re-analyzed post index >> creation >> > > What are the primary (and candidate) keys of the tables? Are any of the > fields nullable? How many distinct values exist for > sequence_alignment.sequence_id? > > >> ~I've tried "set enable_seqscan=off" and set (join_table_order or >> something) = 1 >> > > > It would help if you turned the settings back to defaults before doing > the ANALYZE - or provide the results of that case as well. > > >> The explain plan is as follows: >> > > [cut] > > Without trying to understand the ANALYZE output, I would suggest two > possible optimizations: > > - Using count(distinct(sf.sequence_id)) instead of > count(distinct(sa.sequence_id)). > > - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN > (SELECT sequence_id from sequence_alignment)". > > The first one probably won't help (nor hurt), but the second one might > be able to get rid of the table scan, or at least the need do the full > merge join (which returns an estimated 3 billion rows). > > Hope this helps, > > Nis > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
pgsql-performance by date: