Re: How to improve speed of 3 table join &group (HUGE tables) - Mailing list pgsql-performance

From Nis Jørgensen
Subject Re: How to improve speed of 3 table join &group (HUGE tables)
Date
Msg-id ff8ace$t7e$1@ger.gmane.org
Whole thread Raw
In response to How to improve speed of 3 table join &group (HUGE tables)  (John Major <major@cbio.mskcc.org>)
Responses Re: How to improve speed of 3 table join &group (HUGE tables)  (John Major <major@cbio.mskcc.org>)
List pgsql-performance
John Major skrev:
> I am trying to join three quite large tables, and the query is
> unbearably slow(meaning I can't get results in more than a day of
> processing).
> I've tried the basic optimizations I understand, and nothing has
> improved the execute speed.... any help with this would be greatly
> appreciated
>
>
> The three tables are quite large:
>  sequence_fragment = 4.5 million rows
>  sequence_external_info = 10million rows
>  sequence_alignment = 500 million rows
>
>
> The query I am attempting to run is this:
>
> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
> from   sequence_alignment sa,  sequence_fragment sf,
> fragment_external_info fio
> where sf.seq_frag_id = fio.sequence_frag_id
> and  sf.sequence_id = sa.sequence_id
> group by sf.library_id, fio.clip_type
>
>
> NOTES:
> ~there are indexes on all of the fields being joined (but not on
> library_id or clip_type ). ~Everything has been re-analyzed post index
> creation

What are the primary (and candidate) keys of the tables? Are any of the
fields nullable? How many distinct values exist for
sequence_alignment.sequence_id?

> ~I've tried "set enable_seqscan=off" and set (join_table_order or
> something) = 1


It would help if you turned the settings back to defaults before doing
the ANALYZE - or provide the results of that case as well.

> The explain plan is as follows:

[cut]

Without trying to understand the ANALYZE output, I would suggest two
possible optimizations:

- Using count(distinct(sf.sequence_id)) instead of
count(distinct(sa.sequence_id)).

- Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)".

The first one probably won't help (nor hurt), but the second one might
be able to get rid of the table scan, or at least the need do the full
merge join (which returns an estimated 3 billion rows).

Hope this helps,

Nis

pgsql-performance by date:

Previous
From: John Major
Date:
Subject: How to improve speed of 3 table join &group (HUGE tables)
Next
From: "Heikki Linnakangas"
Date:
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)