Thread: Join of small table with large table

Join of small table with large table

From
large scale
Date:
Hi,

I have two tables, one has 25000 rows and the other
has 6.5 million rows.

(25000 rows)
table1
(id     text,
start  int,
stop  int)

with seperate index on three individual fiels.

6.5 million rows
table2
(id  text,
start  int,
stop  int)

with seperate index on three individual fields.

When I query this two table and try to find overlaped
records, I have used this query:

**************************************************************************************************
select count(distinct(table1.id))
from table1, table2
where table1.id=table2.id
and ( (table2.start>=table1.start and table2.start <=
table1.stop)or (table2.start <= table1.start and table1.start <=
table2.stop) );
***************************************************************************************************

when I do a explain, I got this back:

************************************************************************************************
Aggregate  (cost=353859488.21..353859488.21 rows=1
width=78) ->  Merge Join  (cost=1714676.02..351297983.38
rows=1024601931 width=78)       ->  Index Scan using genescript_genomseqid on
genescript  (cost=0.00..750.35 rows=25115 width=62)       ->  Sort  (cost=1714676.02..1714676.02
rows=6801733 width=16)             ->  Seq Scan on mouseblathuman 
(cost=0.00..153685.33 rows=6801733 width=16)

EXPLAIN
*************************************************************************************************

My question is:  1) Why the query start a seq scan on
a much bigger table from beginning? I think it should
start to scan the first table and use index for the
bigger table.                       2) The query itself takes
forever, is there a way to speed up it?                       3) Does this has anything to
do with query planner?

This is kind of a urgent project, so your prompt help
is greatly appreciated.  Thanks.

Jim

__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Mother's Day is May 12th!
http://shopping.yahoo.com


Re: Join of small table with large table

From
"Rod Taylor"
Date:
The big problem with the query probably isn't the scans due to your
dataset and the way indexes work.
I'm actually rather surprised it chose an index in the smaller table.

It's the agregate thats taking the time.  Which means, faster CPU or
simpler aggregate will do the trick.  Ie.  Do you really need that
DISTINCT part?

--
Rod
----- Original Message -----
From: "large scale" <largescale_1999@yahoo.com>
To: <pgsql-hackers@postgresql.org>
Sent: Friday, May 10, 2002 2:04 PM
Subject: [HACKERS] Join of small table with large table


> Hi,
>
> I have two tables, one has 25000 rows and the other
> has 6.5 million rows.
>
> (25000 rows)
> table1
> (id     text,
> start  int,
> stop  int)
>
> with seperate index on three individual fiels.
>
> 6.5 million rows
> table2
> (id  text,
> start  int,
> stop  int)
>
> with seperate index on three individual fields.
>
> When I query this two table and try to find overlaped
> records, I have used this query:
>
>
**********************************************************************
****************************
> select count(distinct(table1.id))
> from table1, table2
> where table1.id=table2.id
> and ( (table2.start>=table1.start and table2.start <=
> table1.stop)
>  or
>   (table2.start <= table1.start and table1.start <=
> table2.stop) );
>
**********************************************************************
*****************************
>
> when I do a explain, I got this back:
>
>
**********************************************************************
**************************
> Aggregate  (cost=353859488.21..353859488.21 rows=1
> width=78)
>   ->  Merge Join  (cost=1714676.02..351297983.38
> rows=1024601931 width=78)
>         ->  Index Scan using genescript_genomseqid on
> genescript  (cost=0.00..750.35 rows=25115 width=62)
>         ->  Sort  (cost=1714676.02..1714676.02
> rows=6801733 width=16)
>               ->  Seq Scan on mouseblathuman
> (cost=0.00..153685.33 rows=6801733 width=16)
>
> EXPLAIN
>
**********************************************************************
***************************
>
> My question is:  1) Why the query start a seq scan on
> a much bigger table from beginning? I think it should
> start to scan the first table and use index for the
> bigger table.
>                         2) The query itself takes
> forever, is there a way to speed up it?
>                         3) Does this has anything to
> do with query planner?
>
> This is kind of a urgent project, so your prompt help
> is greatly appreciated.  Thanks.
>
> Jim
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Shopping - Mother's Day is May 12th!
> http://shopping.yahoo.com
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>



Re: Join of small table with large table

From
"Rod Taylor"
Date:
Of course, something else you may want to do is is allow postgresql to
use a whack load more sort space in ram -- assumming you have ram
free.

Its probably hitting the disk alot for temporary storage space.

http://www.ca.postgresql.org/docs/momjian/hw_performance/
http://www.argudo.org/postgresql/soft-tuning.html


--
Rod
----- Original Message -----
From: "large scale" <largescale_1999@yahoo.com>
To: <pgsql-hackers@postgresql.org>
Sent: Friday, May 10, 2002 2:04 PM
Subject: [HACKERS] Join of small table with large table


> Hi,
>
> I have two tables, one has 25000 rows and the other
> has 6.5 million rows.
>
> (25000 rows)
> table1
> (id     text,
> start  int,
> stop  int)
>
> with seperate index on three individual fiels.
>
> 6.5 million rows
> table2
> (id  text,
> start  int,
> stop  int)
>
> with seperate index on three individual fields.
>
> When I query this two table and try to find overlaped
> records, I have used this query:
>
>
**********************************************************************
****************************
> select count(distinct(table1.id))
> from table1, table2
> where table1.id=table2.id
> and ( (table2.start>=table1.start and table2.start <=
> table1.stop)
>  or
>   (table2.start <= table1.start and table1.start <=
> table2.stop) );
>
**********************************************************************
*****************************
>
> when I do a explain, I got this back:
>
>
**********************************************************************
**************************
> Aggregate  (cost=353859488.21..353859488.21 rows=1
> width=78)
>   ->  Merge Join  (cost=1714676.02..351297983.38
> rows=1024601931 width=78)
>         ->  Index Scan using genescript_genomseqid on
> genescript  (cost=0.00..750.35 rows=25115 width=62)
>         ->  Sort  (cost=1714676.02..1714676.02
> rows=6801733 width=16)
>               ->  Seq Scan on mouseblathuman
> (cost=0.00..153685.33 rows=6801733 width=16)
>
> EXPLAIN
>
**********************************************************************
***************************
>
> My question is:  1) Why the query start a seq scan on
> a much bigger table from beginning? I think it should
> start to scan the first table and use index for the
> bigger table.
>                         2) The query itself takes
> forever, is there a way to speed up it?
>                         3) Does this has anything to
> do with query planner?
>
> This is kind of a urgent project, so your prompt help
> is greatly appreciated.  Thanks.
>
> Jim
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Shopping - Mother's Day is May 12th!
> http://shopping.yahoo.com
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>



Re: Join of small table with large table

From
Tom Lane
Date:
large scale <largescale_1999@yahoo.com> writes:
> Aggregate  (cost=353859488.21..353859488.21 rows=1
> width=78)
>   ->  Merge Join  (cost=1714676.02..351297983.38
> rows=1024601931 width=78)
>         ->  Index Scan using genescript_genomseqid on
> genescript  (cost=0.00..750.35 rows=25115 width=62)
>         ->  Sort  (cost=1714676.02..1714676.02
> rows=6801733 width=16)
>               ->  Seq Scan on mouseblathuman 
> (cost=0.00..153685.33 rows=6801733 width=16)

That plan seems odd to me too.  Have you done VACUUM ANALYZE on these
tables?

I would think that a hash join would be preferable.  You might need to
increase the SORT_MEM parameter to let the whole smaller table be
stuffed into memory before the planner will think so, though.
Try setting it to 10000 or so (ie, 10 MB).
        regards, tom lane


Re: Join of small table with large table

From
Stephan Szabo
Date:
On Fri, 10 May 2002, large scale wrote:

> Hi,
>
> I have two tables, one has 25000 rows and the other
> has 6.5 million rows.
>
> (25000 rows)
> table1
> (id     text,
> start  int,
> stop  int)
>
> with seperate index on three individual fiels.
>
> 6.5 million rows
> table2
> (id  text,
> start  int,
> stop  int)
>
> with seperate index on three individual fields.

We'll start with the standard questions:  Have you
vacuum analyzed?  What version are you running? (if
it's less than 7.2, you may want to see about
upgrading) If you do a set enable_seqscan=false;
what does the explain show then?  I'd be interested
in know if 1024601931 is even remotely a valid number
of rows from that join as well (which is about
.5% of an entire cartesian join if my math is right).

Perhaps some exists style thing would be faster since
that would at least presumably be able to stop when
it found a matching table2 row for a particular table1
id.