Thread: How to improve speed of 3 table join &group (HUGE tables)

How to improve speed of 3 table join &group (HUGE tables)

From

John Major

Date:

18 October 2007, 14:21:36

I am trying to join three quite large tables, and the query is
unbearably slow(meaning I can't get results in more than a day of
processing).
I've tried the basic optimizations I understand, and nothing has
improved the execute speed.... any help with this would be greatly
appreciated


The three tables are quite large:
  sequence_fragment = 4.5 million rows
  sequence_external_info = 10million rows
  sequence_alignment = 500 million rows


The query I am attempting to run is this:

select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
from   sequence_alignment sa,  sequence_fragment sf,
fragment_external_info fio
where sf.seq_frag_id = fio.sequence_frag_id
and  sf.sequence_id = sa.sequence_id
group by sf.library_id, fio.clip_type


NOTES:
~there are indexes on all of the fields being joined (but not on
library_id or clip_type ).
~Everything has been re-analyzed post index creation
~I've tried "set enable_seqscan=off" and set (join_table_order or
something) = 1

The explain plan is as follows:

 QUERY
PLAN
                          


-------------------------------------------------------------------------------------------------------------------------------------------------------


 GroupAggregate  (cost=1443436673.93..1480593403.29 rows=54
width=16)

   ->  Sort  (cost=1443436673.93..1452725856.10 rows=3715672868
width=16)

         Sort Key: sf.library_id,
fio.clip_type

         ->  Merge Join  (cost=263624049.25..319410068.18
rows=3715672868
width=16)

               Merge Cond: (sf.sequence_id =
sa.sequence_id)

               ->  Sort  (cost=38102888.77..38128373.54 rows=10193906
width=16)

                     Sort Key:
sf.sequence_id
  

                     ->  Hash Join  (cost=5305576.14..36080036.76
rows=10193906
width=16)
                           Hash Cond: (fio.sequence_frag_id =
sf.seq_frag_id)

                           ->  Index Scan using
frag_ext_info_seq_frag_id on fragment_external_info fio
(cost=0.00..30450510.27 rows=10193906 width=12)
                           ->  Hash  (cost=5223807.54..5223807.54
rows=4453728
width=12)
                                 ->  Index Scan using seq_frag_seqid_ind
on sequence_fragment sf  (cost=0.00..5223807.54 rows=4453728
width=12)
               ->  Sort  (cost=225521160.48..226688766.88 rows=467042560
width=4)

                     Sort Key:
sa.sequence_id
  

                     ->  Seq Scan on sequence_alignment sa
(cost=100000000.00..110379294.60 rows=467042560
width=4)

 15 record(s) selected [Fetch MetaData: 0/ms] [Fetch Data: 0/ms]


Thanks in advance!
John Major

Re: How to improve speed of 3 table join &group (HUGE tables)

From

Nis Jørgensen

Date:

18 October 2007, 15:58:50

John Major skrev:
> I am trying to join three quite large tables, and the query is
> unbearably slow(meaning I can't get results in more than a day of
> processing).
> I've tried the basic optimizations I understand, and nothing has
> improved the execute speed.... any help with this would be greatly
> appreciated
>
>
> The three tables are quite large:
>  sequence_fragment = 4.5 million rows
>  sequence_external_info = 10million rows
>  sequence_alignment = 500 million rows
>
>
> The query I am attempting to run is this:
>
> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
> from   sequence_alignment sa,  sequence_fragment sf,
> fragment_external_info fio
> where sf.seq_frag_id = fio.sequence_frag_id
> and  sf.sequence_id = sa.sequence_id
> group by sf.library_id, fio.clip_type
>
>
> NOTES:
> ~there are indexes on all of the fields being joined (but not on
> library_id or clip_type ). ~Everything has been re-analyzed post index
> creation

What are the primary (and candidate) keys of the tables? Are any of the
fields nullable? How many distinct values exist for
sequence_alignment.sequence_id?

> ~I've tried "set enable_seqscan=off" and set (join_table_order or
> something) = 1


It would help if you turned the settings back to defaults before doing
the ANALYZE - or provide the results of that case as well.

> The explain plan is as follows:

[cut]

Without trying to understand the ANALYZE output, I would suggest two
possible optimizations:

- Using count(distinct(sf.sequence_id)) instead of
count(distinct(sa.sequence_id)).

- Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
(SELECT sequence_id from sequence_alignment)".

The first one probably won't help (nor hurt), but the second one might
be able to get rid of the table scan, or at least the need do the full
merge join (which returns an estimated 3 billion rows).

Hope this helps,

Nis

Re: How to improve speed of 3 table join &group (HUGE tables)

From

"Heikki Linnakangas"

Date:

18 October 2007, 16:07:25

John Major wrote:
> ~there are indexes on all of the fields being joined (but not on
> library_id or clip_type ). ~Everything has been re-analyzed post index
> creation
> ~I've tried "set enable_seqscan=off" and set (join_table_order or
> something) = 1

Seqscanning and sorting a table is generally faster than a full scan of
the table using an index scan, unless the heap is roughly in the index
order. You probably need to CLUSTER the tables to use the indexes
effectively.

Are you sure you have an index on sequence_alignment.sequence_id? The
planner seems to choose a seqscan + sort, even though you've set
enable_seqscan=false.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: How to improve speed of 3 table join &group (HUGE tables)

From

John Major

Date:

18 October 2007, 16:47:12

Hello Nis-

I did reset the defaults before running the explain.

Primary keys for the tables.
  sequence_fragment.seq_frag_id
  sequence.sequence_id

Candidate keys.
  fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
  sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).

None of the fields are nullable.

sequence is the anchor table.
  seq_frag_id is the primary key (and foreign key to
fragment_external_info) ~4.5 million unique entries
  sequence_id is an indexed field. ~3 million distinct IDs

sequence_alignment has 500million entries, but i join on sequence_id
which has ~3million entries.


When I make the suggested changes, the new query is:
  select sf.library_id, fio.clip_type , count(sf.sequence_id)
    from    sequence_fragment sf, fragment_external_info fio
    where sf.seq_frag_id = fio.sequence_frag_id
    and  sf.sequence_id IN
        (SELECT sequence_id from sequence_alignment)
    group by sf.library_id, fio.clip_type

After  making the 2 changes, the cost dropped dramatically... but is
still very high.
Original Explain cost:
cost=1308049564..1345206293 rows=54 width=16

New Explain cost:
cost=11831119..11831120 rows=54 width=16

John




Nis Jørgensen wrote:
> John Major skrev:
>
>> I am trying to join three quite large tables, and the query is
>> unbearably slow(meaning I can't get results in more than a day of
>> processing).
>> I've tried the basic optimizations I understand, and nothing has
>> improved the execute speed.... any help with this would be greatly
>> appreciated
>>
>>
>> The three tables are quite large:
>>  sequence_fragment = 4.5 million rows
>>  sequence_external_info = 10million rows
>>  sequence_alignment = 500 million rows
>>
>>
>> The query I am attempting to run is this:
>>
>> select sf.library_id, fio.clip_type , count(distinct(sa.sequence_id))
>> from   sequence_alignment sa,  sequence_fragment sf,
>> fragment_external_info fio
>> where sf.seq_frag_id = fio.sequence_frag_id
>> and  sf.sequence_id = sa.sequence_id
>> group by sf.library_id, fio.clip_type
>>
>>
>> NOTES:
>> ~there are indexes on all of the fields being joined (but not on
>> library_id or clip_type ). ~Everything has been re-analyzed post index
>> creation
>>
>
> What are the primary (and candidate) keys of the tables? Are any of the
> fields nullable? How many distinct values exist for
> sequence_alignment.sequence_id?
>
>
>> ~I've tried "set enable_seqscan=off" and set (join_table_order or
>> something) = 1
>>
>
>
> It would help if you turned the settings back to defaults before doing
> the ANALYZE - or provide the results of that case as well.
>
>
>> The explain plan is as follows:
>>
>
> [cut]
>
> Without trying to understand the ANALYZE output, I would suggest two
> possible optimizations:
>
> - Using count(distinct(sf.sequence_id)) instead of
> count(distinct(sa.sequence_id)).
>
> - Replacing the join to sequence_alignment with "WHERE sf.sequence_id IN
> (SELECT sequence_id from sequence_alignment)".
>
> The first one probably won't help (nor hurt), but the second one might
> be able to get rid of the table scan, or at least the need do the full
> merge join (which returns an estimated 3 billion rows).
>
> Hope this helps,
>
> Nis
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: How to improve speed of 3 table join &group (HUGE tables)

From

John Major

Date:

18 October 2007, 17:05:18

Hi Hekki-

When I turn seq_scan off for the new query:

explain
select sf.library_id, fio.clip_type , count(sf.sequence_id)
   from    sequence_fragment sf, fragment_external_info fio
   where sf.seq_frag_id = fio.sequence_frag_id
   and  sf.sequence_id IN
       (SELECT sequence_id from sequence_alignment)
   group by sf.library_id, fio.clip_type

The index is used... but the cost gets worse!
it goes from:
11831119
-TO-
53654888

Actually... The new query executes in ~ 15 minutes... which is good
enough for me for now.

Thanks Nis!

john



Heikki Linnakangas wrote:
> John Major wrote:
>
>> ~there are indexes on all of the fields being joined (but not on
>> library_id or clip_type ). ~Everything has been re-analyzed post index
>> creation
>> ~I've tried "set enable_seqscan=off" and set (join_table_order or
>> something) = 1
>>
>
> Seqscanning and sorting a table is generally faster than a full scan of
> the table using an index scan, unless the heap is roughly in the index
> order. You probably need to CLUSTER the tables to use the indexes
> effectively.
>
> Are you sure you have an index on sequence_alignment.sequence_id? The
> planner seems to choose a seqscan + sort, even though you've set
> enable_seqscan=false.
>
>

Re: How to improve speed of 3 table join &group (HUGE tables)

From

ismo.tuononen@solenovo.fi

Date:

19 October 2007, 02:17:51

Hi,

how about:

select sf.library_id, fio.clip_type , count(sf.sequence_id)
  from    sequence_fragment sf, fragment_external_info fio
        ,(SELECT distinct sequence_id from sequence_alignment) sa
  where sf.seq_frag_id = fio.sequence_frag_id
  and  sf.sequence_id = sa.sequence_id
  group by sf.library_id, fio.clip_type

I don't know postgres well, but I would put my bet in Oracle in that
derived table instead of that in clause.

Ismo

On Thu, 18 Oct 2007, John Major wrote:

> Hi Hekki-
>
> When I turn seq_scan off for the new query:
>
> explain
> select sf.library_id, fio.clip_type , count(sf.sequence_id)
>   from    sequence_fragment sf, fragment_external_info fio
>   where sf.seq_frag_id = fio.sequence_frag_id
>   and  sf.sequence_id IN
>       (SELECT sequence_id from sequence_alignment)
>   group by sf.library_id, fio.clip_type
>
> The index is used... but the cost gets worse!
> it goes from:
> 11831119
> -TO-
> 53654888
>
> Actually... The new query executes in ~ 15 minutes... which is good enough for
> me for now.
>
> Thanks Nis!
>
> john
>
>
>
> Heikki Linnakangas wrote:
> > John Major wrote:
> >
> > > ~there are indexes on all of the fields being joined (but not on
> > > library_id or clip_type ). ~Everything has been re-analyzed post index
> > > creation
> > > ~I've tried "set enable_seqscan=off" and set (join_table_order or
> > > something) = 1
> > >
> >
> > Seqscanning and sorting a table is generally faster than a full scan of
> > the table using an index scan, unless the heap is roughly in the index
> > order. You probably need to CLUSTER the tables to use the indexes
> > effectively.
> >
> > Are you sure you have an index on sequence_alignment.sequence_id? The
> > planner seems to choose a seqscan + sort, even though you've set
> > enable_seqscan=false.
> >
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>
>

Re: How to improve speed of 3 table join &group (HUGE tables)

From

Nis Jørgensen

Date:

23 October 2007, 08:18:52

John Major skrev:
> Hello Nis-
>
> I did reset the defaults before running the explain.

This line from your original post:

->  Seq Scan on sequence_alignment sa  (cost=100000000.00..110379294.60
rows=467042560 width=4)

Is an indication that you didn't (AFAIK enable_seqscan=off works by
setting the cost of starting a seqscan to 100000000).

> Candidate keys.
>  fragment_external_info.seq_frag_id (FK to sequence_fragment.seq_frag_id)
>  sequence_alignment.sequence_id (FK to sequence_fragment.sequence_id).

Those are not candidate keys. A candidate key is "something which could
have been chosen as the primary key". Anyway, I think I understand your
table layout now. It might have been quicker if you just posted the
definition of your tables. This could also have shown us that the
correct indexes are in place, rather than taking your word for it.

You are absolutely certain that both sides of all FK relationships are
indexed?

> After  making the 2 changes, the cost dropped dramatically... but is still very high.
> Original Explain cost:
> cost=1308049564..1345206293 rows=54 width=16
>
> New Explain cost:
> cost=11831119..11831120 rows=54 width=16

Please post the full output if you want more help. And preferably use
EXPLAIN ANALYZE, now that it runs in finite time.


Nis