Re: How to improve speed of 3 table join &group (HUGE tables) - Mailing list pgsql-performance

From John Major
Subject Re: How to improve speed of 3 table join &group (HUGE tables)
Date
Msg-id 4717B80B.6050706@cbio.mskcc.org
Whole thread Raw
In response to Re: How to improve speed of 3 table join &group (HUGE tables)  (Nis Jørgensen <nis@superlativ.dk>)
Responses Re: How to improve speed of 3 table join &group (HUGE tables)
List pgsql-performance
Hello Nis-

I did reset the defaults before running the explain.

Primary keys for the tables.
  sequence_fragment.seq_frag_id
  sequence.sequence_id

Candidate keys.
  fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
  sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).

None of the fields are nullable.

sequence is the anchor table.
  seq_frag_id is the primary key (and foreign key to
fragment_external_info) ~4.5 million unique entries
  sequence_id is an indexed field. ~3 million distinct IDs

sequence_alignment has 500million entries, but i join on sequence_id
which has ~3million entries.


When I make the suggested changes, the new query is:
  select sf.library_id, fio.clip_type , count(sf.sequence_id)
    from    sequence_fragment sf, fragment_external_info fio
    where sf.seq_frag_id = fio.sequence_frag_id
    and  sf.sequence_id IN
        (SELECT sequence_id from sequence_alignment)
    group by sf.library_id, fio.clip_type

After  making the 2 changes, the cost dropped dramatically... but is
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16

New Explain cost:
cost=11831119..11831120 rows=54 width=16

John




Nis Jørgensen wrote:
> John Major skrev:
>
>> I am trying to join three quite large tables, and the query is
>> unbearably slow(meaning I can't get results in more than a day of
>> processing).
>> I've tried the basic optimizations I understand, and nothing has
>> improved the execute speed.... any help with this would be greatly
>> appreciated
>>
>>
>> The three tables are quite large:
>>  sequence_fragment = 4.5 million rows
>>  sequence_external_info = 10million rows
>>  sequence_alignment = 500 million rows
>>
>>
>> The query I am attempting to run is this:
>>
>> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
>> from   sequence_alignment sa,  sequence_fragment sf,
>> fragment_external_info fio
>> where sf.seq_frag_id = fio.sequence_frag_id
>> and  sf.sequence_id = sa.sequence_id
>> group by sf.library_id, fio.clip_type
>>
>>
>> NOTES:
>> ~there are indexes on all of the fields being joined (but not on
>> library_id or clip_type ). ~Everything has been re-analyzed post index
>> creation
>>
>
> What are the primary (and candidate) keys of the tables? Are any of the
> fields nullable? How many distinct values exist for
> sequence_alignment.sequence_id?
>
>
>> ~I've tried "set enable_seqscan=off" and set (join_table_order or
>> something) = 1
>>
>
>
> It would help if you turned the settings back to defaults before doing
> the ANALYZE - or provide the results of that case as well.
>
>
>> The explain plan is as follows:
>>
>
> [cut]
>
> Without trying to understand the ANALYZE output, I would suggest two
> possible optimizations:
>
> - Using count(distinct(sf.sequence_id)) instead of
> count(distinct(sa.sequence_id)).
>
> - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
> (SELECT sequence_id from sequence_alignment)".
>
> The first one probably won't help (nor hurt), but the second one might
> be able to get rid of the table scan, or at least the need do the full
> merge join (which returns an estimated 3 billion rows).
>
> Hope this helps,
>
> Nis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


pgsql-performance by date:

Previous
From: "Heikki Linnakangas"
Date:
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)
Next
From: John Major
Date:
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)