Home > mailing lists

Re: How to improve speed of 3 table join &group (HUGE tables) - Mailing list pgsql-performance

From	Nis Jørgensen
Subject	Re: How to improve speed of 3 table join &group (HUGE tables)
Date	October 18, 2007 18:58:50
Msg-id	ff8ace$t7e$1@ger.gmane.org Whole thread Raw
In response to	How to improve speed of 3 table join &group (HUGE tables) (John Major <major@cbio.mskcc.org>)
Responses	Re: How to improve speed of 3 table join &group (HUGE tables)
List	pgsql-performance

Tree view

John Major skrev:
> I am trying to join three quite large tables, and the query is
> unbearably slow(meaning I can't get results in more than a day of
> processing).
> I've tried the basic optimizations I understand, and nothing has
> improved the execute speed.... any help with this would be greatly
> appreciated
>
>
> The three tables are quite large:
>  sequence_fragment = 4.5 million rows
>  sequence_external_info = 10million rows
>  sequence_alignment = 500 million rows
>
>
> The query I am attempting to run is this:
>
> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
> from   sequence_alignment sa,  sequence_fragment sf,
> fragment_external_info fio
> where sf.seq_frag_id = fio.sequence_frag_id
> and  sf.sequence_id = sa.sequence_id
> group by sf.library_id, fio.clip_type
>
>
> NOTES:
> ~there are indexes on all of the fields being joined (but not on
> library_id or clip_type ). ~Everything has been re-analyzed post index
> creation

What are the primary (and candidate) keys of the tables? Are any of the
fields nullable? How many distinct values exist for
sequence_alignment.sequence_id?

> ~I've tried "set enable_seqscan=off" and set (join_table_order or
> something) = 1


It would help if you turned the settings back to defaults before doing
the ANALYZE - or provide the results of that case as well.

> The explain plan is as follows:

[cut]

Without trying to understand the ANALYZE output, I would suggest two
possible optimizations:

- Using count(distinct(sf.sequence_id)) instead of
count(distinct(sa.sequence_id)).

- Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)".

The first one probably won't help (nor hurt), but the second one might
be able to get rid of the table scan, or at least the need do the full
merge join (which returns an estimated 3 billion rows).

Hope this helps,

Nis

pgsql-performance by date:

From: John Major
Date: 18 October 2007, 17:21:36
Subject: How to improve speed of 3 table join &group (HUGE tables)

From: "Heikki Linnakangas"
Date: 18 October 2007, 19:07:25
Subject: Re: How to improve speed of 3 table join &group (HUGE tables)

Re: How to improve speed of 3 table join &group (HUGE tables) - Mailing list pgsql-performance

Previous

Next