Thread: Joining 2 tables with 300 million rows

Joining 2 tables with 300 million rows

From
Amit V Shah
Date:
Hi all,

First of all, please pardon if the question is dumb! Is it even feasible or
normal to do such a thing ! This query is needed by a webpage so needs to be
lightning fast. Anything beyond 2-3 seconds is unacceptable performance.

I have two tables

CREATE TABLE runresult
(
  id_runresult int8 NOT NULL,
  rundefinition_id_rundefinition int4 NOT NULL,
  measure_id_measure int4 NOT NULL,
  value float4 NOT NULL,
  "sequence" varchar(20) NOT NULL,
  CONSTRAINT pk_runresult_ars PRIMARY KEY (id_runresult),
)


CREATE TABLE runresult_has_catalogtable
(
  runresult_id_runresult int8 NOT NULL,
  catalogtable_id_catalogtable int4 NOT NULL,
  value int4 NOT NULL,
  CONSTRAINT pk_runresult_has_catalogtable PRIMARY KEY
(runresult_id_runresult, catalogtable_id_catalogtable, value)
  CONSTRAINT fk_temp FOREIGN KEY (runresult_id_runresult) REFERENCES
runresult(id_runresult) ON UPDATE RESTRICT ON DELETE RESTRICT
)

Each table has around 300 million records (will grow to probably billions).
Below is the query and the explain analyze --

explain analyze SELECT measure.description, runresult.value
FROM ((((rundefinition INNER JOIN runresult ON
rundefinition.id_rundefinition = runresult.rundefinition_id_rundefinition)
INNER JOIN runresult_has_catalogtable ON runresult.id_runresult =
runresult_has_catalogtable.runresult_id_runresult)
INNER JOIN runresult_has_catalogtable AS runresult_has_catalogtable_1 ON
runresult.id_runresult =
runresult_has_catalogtable_1.runresult_id_runresult)
INNER JOIN runresult_has_catalogtable AS runresult_has_catalogtable_2 ON
runresult.id_runresult =
runresult_has_catalogtable_2.runresult_id_runresult)
INNER JOIN measure ON runresult.measure_id_measure = measure.id_measure
WHERE (((runresult_has_catalogtable.catalogtable_id_catalogtable)=52)
AND ((runresult_has_catalogtable_1.catalogtable_id_catalogtable)=54)
AND ((runresult_has_catalogtable_2.catalogtable_id_catalogtable)=55)
AND ((runresult_has_catalogtable.value)=15806)
AND ((runresult_has_catalogtable_1.value)=1)
AND ((runresult_has_catalogtable_2.value) In (21,22,23,24))
AND ((rundefinition.id_rundefinition)=10106));

'Nested Loop  (cost=0.00..622582.70 rows=1 width=28) (actual
time=25.221..150.563 rows=22 loops=1)'
'  ->  Nested Loop  (cost=0.00..622422.24 rows=2 width=52) (actual
time=25.201..150.177 rows=22 loops=1)'
'        ->  Nested Loop  (cost=0.00..622415.97 rows=2 width=32) (actual
time=25.106..149.768 rows=22 loops=1)'
'              ->  Nested Loop  (cost=0.00..621258.54 rows=15 width=24)
(actual time=24.582..149.061 rows=30 loops=1)'
'                    ->  Index Scan using pk_rundefinition on rundefinition
(cost=0.00..3.86 rows=1 width=4) (actual time=0.125..0.147 rows=1 loops=1)'
'                          Index Cond: (id_rundefinition = 10106)'
'                    ->  Nested Loop  (cost=0.00..621254.54 rows=15
width=28) (actual time=24.443..148.784 rows=30 loops=1)'
'                          ->  Index Scan using
runresult_has_catalogtable_value on runresult_has_catalogtable
(cost=0.00..575069.35 rows=14437 width=8) (actual time=0.791..33.036
rows=10402 loops=1)'
'                                Index Cond: (value = 15806)'
'                                Filter: (catalogtable_id_catalogtable =
52)'
'                          ->  Index Scan using pk_runresult_ars on
runresult  (cost=0.00..3.19 rows=1 width=20) (actual time=0.007..0.007
rows=0 loops=10402)'
'                                Index Cond: (runresult.id_runresult =
"outer".runresult_id_runresult)'
'                                Filter: (10106 =
rundefinition_id_rundefinition)'
'              ->  Index Scan using runresult_has_catalogtable_id_runresult
on runresult_has_catalogtable runresult_has_catalogtable_1
(cost=0.00..76.65 rows=41 width=8) (actual time=0.015..0.017 rows=1
loops=30)'
'                    Index Cond:
(runresult_has_catalogtable_1.runresult_id_runresult =
"outer".runresult_id_runresult)'
'                    Filter: ((catalogtable_id_catalogtable = 54) AND (value
= 1))'
'        ->  Index Scan using pk_measure on measure  (cost=0.00..3.12 rows=1
width=28) (actual time=0.008..0.010 rows=1 loops=22)'
'              Index Cond: ("outer".measure_id_measure =
measure.id_measure)'
'  ->  Index Scan using runresult_has_catalogtable_id_runresult on
runresult_has_catalogtable runresult_has_catalogtable_2  (cost=0.00..79.42
rows=65 width=8) (actual time=0.007..0.010 rows=1 loops=22)'
'        Index Cond: (runresult_has_catalogtable_2.runresult_id_runresult =
"outer".runresult_id_runresult)'
'        Filter: ((catalogtable_id_catalogtable = 55) AND ((value = 21) OR
(value = 22) OR (value = 23) OR (value = 24)))'
'Total runtime: 150.863 ms'



Re: Joining 2 tables with 300 million rows

From
"Dmitri Bichko"
Date:
What's the problem?  You are joining two 300 million row tables in 0.15
of a second - seems reasonable.

Dmitri

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Amit V Shah
> Sent: Thursday, December 08, 2005 11:59 AM
> To: 'pgsql-performance@postgresql.org'
> Subject: [PERFORM] Joining 2 tables with 300 million rows
>
>
> Hi all,
>
> First of all, please pardon if the question is dumb! Is it
> even feasible or normal to do such a thing ! This query is
> needed by a webpage so needs to be lightning fast. Anything
> beyond 2-3 seconds is unacceptable performance.
>
> I have two tables
>
> CREATE TABLE runresult
> (
>   id_runresult int8 NOT NULL,
>   rundefinition_id_rundefinition int4 NOT NULL,
>   measure_id_measure int4 NOT NULL,
>   value float4 NOT NULL,
>   "sequence" varchar(20) NOT NULL,
>   CONSTRAINT pk_runresult_ars PRIMARY KEY (id_runresult),
> )
>
>
> CREATE TABLE runresult_has_catalogtable
> (
>   runresult_id_runresult int8 NOT NULL,
>   catalogtable_id_catalogtable int4 NOT NULL,
>   value int4 NOT NULL,
>   CONSTRAINT pk_runresult_has_catalogtable PRIMARY KEY
> (runresult_id_runresult, catalogtable_id_catalogtable, value)
>   CONSTRAINT fk_temp FOREIGN KEY (runresult_id_runresult) REFERENCES
> runresult(id_runresult) ON UPDATE RESTRICT ON DELETE RESTRICT
> )
>
> Each table has around 300 million records (will grow to
> probably billions). Below is the query and the explain analyze --
>
> explain analyze SELECT measure.description, runresult.value
> FROM ((((rundefinition INNER JOIN runresult ON
> rundefinition.id_rundefinition =
> runresult.rundefinition_id_rundefinition)
> INNER JOIN runresult_has_catalogtable ON runresult.id_runresult =
> runresult_has_catalogtable.runresult_id_runresult)
> INNER JOIN runresult_has_catalogtable AS
> runresult_has_catalogtable_1 ON runresult.id_runresult =
> runresult_has_catalogtable_1.runresult_id_runresult)
> INNER JOIN runresult_has_catalogtable AS
> runresult_has_catalogtable_2 ON runresult.id_runresult =
> runresult_has_catalogtable_2.runresult_id_runresult)
> INNER JOIN measure ON runresult.measure_id_measure =
> measure.id_measure WHERE
> (((runresult_has_catalogtable.catalogtable_id_catalogtable)=52)
> AND ((runresult_has_catalogtable_1.catalogtable_id_catalogtable)=54)
> AND ((runresult_has_catalogtable_2.catalogtable_id_catalogtable)=55)
> AND ((runresult_has_catalogtable.value)=15806)
> AND ((runresult_has_catalogtable_1.value)=1)
> AND ((runresult_has_catalogtable_2.value) In (21,22,23,24))
> AND ((rundefinition.id_rundefinition)=10106));
>
> 'Nested Loop  (cost=0.00..622582.70 rows=1 width=28) (actual
> time=25.221..150.563 rows=22 loops=1)' '  ->  Nested Loop
> (cost=0.00..622422.24 rows=2 width=52) (actual
> time=25.201..150.177 rows=22 loops=1)'
> '        ->  Nested Loop  (cost=0.00..622415.97 rows=2
> width=32) (actual
> time=25.106..149.768 rows=22 loops=1)'
> '              ->  Nested Loop  (cost=0.00..621258.54 rows=15
> width=24)
> (actual time=24.582..149.061 rows=30 loops=1)'
> '                    ->  Index Scan using pk_rundefinition on
> rundefinition
> (cost=0.00..3.86 rows=1 width=4) (actual time=0.125..0.147
> rows=1 loops=1)'
> '                          Index Cond: (id_rundefinition = 10106)'
> '                    ->  Nested Loop  (cost=0.00..621254.54 rows=15
> width=28) (actual time=24.443..148.784 rows=30 loops=1)'
> '                          ->  Index Scan using
> runresult_has_catalogtable_value on
> runresult_has_catalogtable (cost=0.00..575069.35 rows=14437
> width=8) (actual time=0.791..33.036 rows=10402 loops=1)'
> '                                Index Cond: (value = 15806)'
> '                                Filter:
> (catalogtable_id_catalogtable =
> 52)'
> '                          ->  Index Scan using pk_runresult_ars on
> runresult  (cost=0.00..3.19 rows=1 width=20) (actual
> time=0.007..0.007 rows=0 loops=10402)'
> '                                Index Cond: (runresult.id_runresult =
> "outer".runresult_id_runresult)'
> '                                Filter: (10106 =
> rundefinition_id_rundefinition)'
> '              ->  Index Scan using
> runresult_has_catalogtable_id_runresult
> on runresult_has_catalogtable runresult_has_catalogtable_1
> (cost=0.00..76.65 rows=41 width=8) (actual time=0.015..0.017
> rows=1 loops=30)'
> '                    Index Cond:
> (runresult_has_catalogtable_1.runresult_id_runresult =
> "outer".runresult_id_runresult)'
> '                    Filter: ((catalogtable_id_catalogtable =
> 54) AND (value
> = 1))'
> '        ->  Index Scan using pk_measure on measure
> (cost=0.00..3.12 rows=1
> width=28) (actual time=0.008..0.010 rows=1 loops=22)'
> '              Index Cond: ("outer".measure_id_measure =
> measure.id_measure)'
> '  ->  Index Scan using
> runresult_has_catalogtable_id_runresult on
> runresult_has_catalogtable runresult_has_catalogtable_2
> (cost=0.00..79.42 rows=65 width=8) (actual time=0.007..0.010
> rows=1 loops=22)'
> '        Index Cond:
> (runresult_has_catalogtable_2.runresult_id_runresult =
> "outer".runresult_id_runresult)'
> '        Filter: ((catalogtable_id_catalogtable = 55) AND
> ((value = 21) OR
> (value = 22) OR (value = 23) OR (value = 24)))'
> 'Total runtime: 150.863 ms'
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
The information transmitted is intended only for the person or entity to which it is addressed and may contain
confidentialand/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any
actionin reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you
receivedthis in error, please contact the sender and delete the material from any computer 

Re: Joining 2 tables with 300 million rows

From
Amit V Shah
Date:
Hi,

The thing is, although it shows 0.15 seconds, when I run the actual query,
it takes around 40-45 seconds (sorry I forgot to mention that). And then
sometimes it depends on data. Some parameters have very less number of
records, and others have lot more. I dont know how to read the "explan"
results very well, but looked like there were no sequential scans and it
only used indexes.

Also, another problem is, the second time I run this query, it returns it
from cache I believe. So the second time I run it, it returns in like 2
seconds or
so !

Thats why I was worrying if joining 2 tables like that is even advisable at
all ...

Thanks,
Amit

-----Original Message-----
From: Dmitri Bichko [mailto:dbichko@aveopharma.com]
Sent: Thursday, December 08, 2005 1:47 PM
To: Amit V Shah; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Joining 2 tables with 300 million rows


What's the problem?  You are joining two 300 million row tables in 0.15
of a second - seems reasonable.

Dmitri

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of
> Amit V Shah
> Sent: Thursday, December 08, 2005 11:59 AM
> To: 'pgsql-performance@postgresql.org'
> Subject: [PERFORM] Joining 2 tables with 300 million rows
>
>
> Hi all,
>
> First of all, please pardon if the question is dumb! Is it
> even feasible or normal to do such a thing ! This query is
> needed by a webpage so needs to be lightning fast. Anything
> beyond 2-3 seconds is unacceptable performance.
>
> I have two tables
>
> CREATE TABLE runresult
> (
>   id_runresult int8 NOT NULL,
>   rundefinition_id_rundefinition int4 NOT NULL,
>   measure_id_measure int4 NOT NULL,
>   value float4 NOT NULL,
>   "sequence" varchar(20) NOT NULL,
>   CONSTRAINT pk_runresult_ars PRIMARY KEY (id_runresult),
> )
>
>
> CREATE TABLE runresult_has_catalogtable
> (
>   runresult_id_runresult int8 NOT NULL,
>   catalogtable_id_catalogtable int4 NOT NULL,
>   value int4 NOT NULL,
>   CONSTRAINT pk_runresult_has_catalogtable PRIMARY KEY
> (runresult_id_runresult, catalogtable_id_catalogtable, value)
>   CONSTRAINT fk_temp FOREIGN KEY (runresult_id_runresult) REFERENCES
> runresult(id_runresult) ON UPDATE RESTRICT ON DELETE RESTRICT
> )
>
> Each table has around 300 million records (will grow to
> probably billions). Below is the query and the explain analyze --
>
> explain analyze SELECT measure.description, runresult.value
> FROM ((((rundefinition INNER JOIN runresult ON
> rundefinition.id_rundefinition =
> runresult.rundefinition_id_rundefinition)
> INNER JOIN runresult_has_catalogtable ON runresult.id_runresult =
> runresult_has_catalogtable.runresult_id_runresult)
> INNER JOIN runresult_has_catalogtable AS
> runresult_has_catalogtable_1 ON runresult.id_runresult =
> runresult_has_catalogtable_1.runresult_id_runresult)
> INNER JOIN runresult_has_catalogtable AS
> runresult_has_catalogtable_2 ON runresult.id_runresult =
> runresult_has_catalogtable_2.runresult_id_runresult)
> INNER JOIN measure ON runresult.measure_id_measure =
> measure.id_measure WHERE
> (((runresult_has_catalogtable.catalogtable_id_catalogtable)=52)
> AND ((runresult_has_catalogtable_1.catalogtable_id_catalogtable)=54)
> AND ((runresult_has_catalogtable_2.catalogtable_id_catalogtable)=55)
> AND ((runresult_has_catalogtable.value)=15806)
> AND ((runresult_has_catalogtable_1.value)=1)
> AND ((runresult_has_catalogtable_2.value) In (21,22,23,24))
> AND ((rundefinition.id_rundefinition)=10106));
>
> 'Nested Loop  (cost=0.00..622582.70 rows=1 width=28) (actual
> time=25.221..150.563 rows=22 loops=1)' '  ->  Nested Loop
> (cost=0.00..622422.24 rows=2 width=52) (actual
> time=25.201..150.177 rows=22 loops=1)'
> '        ->  Nested Loop  (cost=0.00..622415.97 rows=2
> width=32) (actual
> time=25.106..149.768 rows=22 loops=1)'
> '              ->  Nested Loop  (cost=0.00..621258.54 rows=15
> width=24)
> (actual time=24.582..149.061 rows=30 loops=1)'
> '                    ->  Index Scan using pk_rundefinition on
> rundefinition
> (cost=0.00..3.86 rows=1 width=4) (actual time=0.125..0.147
> rows=1 loops=1)'
> '                          Index Cond: (id_rundefinition = 10106)'
> '                    ->  Nested Loop  (cost=0.00..621254.54 rows=15
> width=28) (actual time=24.443..148.784 rows=30 loops=1)'
> '                          ->  Index Scan using
> runresult_has_catalogtable_value on
> runresult_has_catalogtable (cost=0.00..575069.35 rows=14437
> width=8) (actual time=0.791..33.036 rows=10402 loops=1)'
> '                                Index Cond: (value = 15806)'
> '                                Filter:
> (catalogtable_id_catalogtable =
> 52)'
> '                          ->  Index Scan using pk_runresult_ars on
> runresult  (cost=0.00..3.19 rows=1 width=20) (actual
> time=0.007..0.007 rows=0 loops=10402)'
> '                                Index Cond: (runresult.id_runresult =
> "outer".runresult_id_runresult)'
> '                                Filter: (10106 =
> rundefinition_id_rundefinition)'
> '              ->  Index Scan using
> runresult_has_catalogtable_id_runresult
> on runresult_has_catalogtable runresult_has_catalogtable_1
> (cost=0.00..76.65 rows=41 width=8) (actual time=0.015..0.017
> rows=1 loops=30)'
> '                    Index Cond:
> (runresult_has_catalogtable_1.runresult_id_runresult =
> "outer".runresult_id_runresult)'
> '                    Filter: ((catalogtable_id_catalogtable =
> 54) AND (value
> = 1))'
> '        ->  Index Scan using pk_measure on measure
> (cost=0.00..3.12 rows=1
> width=28) (actual time=0.008..0.010 rows=1 loops=22)'
> '              Index Cond: ("outer".measure_id_measure =
> measure.id_measure)'
> '  ->  Index Scan using
> runresult_has_catalogtable_id_runresult on
> runresult_has_catalogtable runresult_has_catalogtable_2
> (cost=0.00..79.42 rows=65 width=8) (actual time=0.007..0.010
> rows=1 loops=22)'
> '        Index Cond:
> (runresult_has_catalogtable_2.runresult_id_runresult =
> "outer".runresult_id_runresult)'
> '        Filter: ((catalogtable_id_catalogtable = 55) AND
> ((value = 21) OR
> (value = 22) OR (value = 23) OR (value = 24)))'
> 'Total runtime: 150.863 ms'
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


Re: Joining 2 tables with 300 million rows

From
Jeff Trout
Date:
On Dec 8, 2005, at 5:01 PM, Amit V Shah wrote:

> Hi,
>
> The thing is, although it shows 0.15 seconds, when I run the actual
> query,
> it takes around 40-45 seconds (sorry I forgot to mention that). And
> then
> sometimes it depends on data. Some parameters have very less number of
> records, and others have lot more. I dont know how to read the
> "explan"
> results very well, but looked like there were no sequential scans
> and it
> only used indexes.
>

The planner will look at the data you used and it may decide to
switch the plan if it realizes your're quering a very frequent value.

Another thing that may be a factor is the network - when doing
explain analyze it doesn't have to transfer the dataset to the client.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Re: Joining 2 tables with 300 million rows

From
Manfred Koizar
Date:
On Thu, 8 Dec 2005 11:59:24 -0500 , Amit V Shah <ashah@tagaudit.com>
wrote:
>  CONSTRAINT pk_runresult_has_catalogtable PRIMARY KEY
>(runresult_id_runresult, catalogtable_id_catalogtable, value)

>'              ->  Index Scan using runresult_has_catalogtable_id_runresult
>on runresult_has_catalogtable runresult_has_catalogtable_1
>(cost=0.00..76.65 rows=41 width=8) (actual time=0.015..0.017 rows=1
>loops=30)'
>'                    Index Cond:
>(runresult_has_catalogtable_1.runresult_id_runresult =
>"outer".runresult_id_runresult)'
>'                    Filter: ((catalogtable_id_catalogtable = 54) AND (value
>= 1))'

If I were the planner, I'd use the primary key index.  You seem to
have a redundant(?) index on
runresult_has_catalogtable(runresult_id_runresult).  Dropping it might
help, or it might make things much worse.  But at this stage this is
pure speculation.

Give us more information first.  Show us the complete definition
(including *all* indices) of all tables occurring in your query.  What
Postgres version is this?  And please post EXPLAIN ANALYSE output of a
*slow* query.
Servus
 Manfred