Thread: TimeOf(Subselects|Joins)FromLargeTables?

TimeOf(Subselects|Joins)FromLargeTables?

From
"Hegedus, Tamas ."
Date:
Dear All,

I am a biologist and I do not know what to expect from an RDB (PgSQL).
I have large tables: 1215607 rows in prots, 2184596 rows in kwx (see table details below). I would like to do something
likethat: 

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812);

After executing this (either as a subquery or joins) the best/fastest result what I had (SET enable_seqscan=off):
83643.482 ms (see EXPLAIN ANALYZE below).

The two (similar) parts of this query are executed much faster:
SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms, n(rows)=78050
SELECT name, seq FROM prots WHERE fid < 80000 -- takes 1969.231 ms

Is this realistic? OK?
If not: how can I increase the speed by fine tuning of the RDB (indexes, run-time parameters) or my SQL query?
(It came now into my mind: if I decrease the number of columns in the prots table (to have only 3 fields (fid, name,
seq)instead of 20 columns), than the prots table will have smaller file size on disk, than this table may need less
diskpage fetches, queries may be faster. Is this true?) 

Thanks for your help!
Tamas

===============================================
Table "public.prots"
  Column   |         Type          | Modifiers
-----------+----------------------+----------
 fid       | integer               | not null
 name      | character varying(10) | not null
[...other 17 columns...]
 seq       | text                  |
Indexes:
    "prots_pkey" primary key, btree (fid)
    "ix_prots_acc" unique, btree (acc)
    "ix_prots_name" unique, btree (name)
    "ix_prots_class" btree ("class")
===============================================
Table "public.kwx"
 Column |  Type   | Modifiers
--------+--------+----------
 fid    | integer |
 kw_acc | integer |
Indexes:
    "ix_kwx_acc" btree (kw_acc)
    "ix_kwx_fid" btree (fid)
Foreign-key constraints:
    "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
    "fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
===============================================

EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 812;
                                                                  QUERY PLAN
                                                                       

----------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=0.00..160429.66 rows=84473 width=349) (actual time=29.039..83505.629 rows=78050 loops=1)
   Merge Cond: ("outer".fid = "inner".fid)
   ->  Index Scan using ix_kwx_fid on kwx  (cost=0.00..44987.55 rows=84473 width=4) (actual time=18.893..5730.468
rows=78050loops=1) 
         Filter: (kw_acc = 812)
   ->  Index Scan using prots_pkey on prots  (cost=0.00..112005.24 rows=981127 width=353) (actual time=0.083..76059.235
rows=1210377loops=1) 
 Total runtime: 83643.482 ms
(6 rows)



Re: TimeOf(Subselects|Joins)FromLargeTables?

From
"Dann Corbit"
Date:
How does this query perform:

SELECT p.name, p.seq
FROM prots p, kwx k
WHERE
    p.fid=k.fid
    AND
    k.kw_acc=812
;

> -----Original Message-----
> From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> Sent: Thursday, June 03, 2004 6:48 PM
> To: 'pgsql-general@postgresql.org'
> Subject: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> Dear All,
>
> I am a biologist and I do not know what to expect from an RDB
> (PgSQL). I have large tables: 1215607 rows in prots, 2184596
> rows in kwx (see table details below). I would like to do
> something like that:
>
> SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx
> WHERE kw_acc=812);
>
> After executing this (either as a subquery or joins) the
> best/fastest result what I had (SET enable_seqscan=off):
> 83643.482 ms (see EXPLAIN ANALYZE below).
>
> The two (similar) parts of this query are executed much
> faster: SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms,
> n(rows)=78050 SELECT name, seq FROM prots WHERE fid < 80000
> -- takes 1969.231 ms
>
> Is this realistic? OK?
> If not: how can I increase the speed by fine tuning of the
> RDB (indexes, run-time parameters) or my SQL query? (It came
> now into my mind: if I decrease the number of columns in the
> prots table (to have only 3 fields (fid, name, seq) instead
> of 20 columns), than the prots table will have smaller file
> size on disk, than this table may need less disk page
> fetches, queries may be faster. Is this true?)
>
> Thanks for your help!
> Tamas
>
> ===============================================
> Table "public.prots"
>   Column   |         Type          | Modifiers
> -----------+----------------------+----------
>  fid       | integer               | not null
>  name      | character varying(10) | not null
> [...other 17 columns...]
>  seq       | text                  |
> Indexes:
>     "prots_pkey" primary key, btree (fid)
>     "ix_prots_acc" unique, btree (acc)
>     "ix_prots_name" unique, btree (name)
>     "ix_prots_class" btree ("class")
> ===============================================
> Table "public.kwx"
>  Column |  Type   | Modifiers
> --------+--------+----------
>  fid    | integer |
>  kw_acc | integer |
> Indexes:
>     "ix_kwx_acc" btree (kw_acc)
>     "ix_kwx_fid" btree (fid)
> Foreign-key constraints:
>     "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
>     "fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
> ===============================================
>
> EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on
> (prots.fid=kwx.fid) where kwx.kw_acc = 812;
>
>     QUERY PLAN
>
> --------------------------------------------------------------
> --------------------------------------------------------------
> ------------------
>  Merge Join  (cost=0.00..160429.66 rows=84473 width=349)
> (actual time=29.039..83505.629 rows=78050 loops=1)
>    Merge Cond: ("outer".fid = "inner".fid)
>    ->  Index Scan using ix_kwx_fid on kwx
> (cost=0.00..44987.55 rows=84473 width=4) (actual
> time=18.893..5730.468 rows=78050 loops=1)
>          Filter: (kw_acc = 812)
>    ->  Index Scan using prots_pkey on prots
> (cost=0.00..112005.24 rows=981127 width=353) (actual
> time=0.083..76059.235 rows=1210377 loops=1)  Total runtime:
> 83643.482 ms (6 rows)
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>

Re: TimeOf(Subselects|Joins)FromLargeTables?

From
Duane Lee - EGOVX
Date:

Or this correlated subquery:

SELECT a.name, a.seq FROM prots as a
WHERE EXISTS (
SELECT '1'FROM kwx as b
WHERE b.kw_acc=812
  and b.fid = a.fid
);

-----Original Message-----
From: Dann Corbit [mailto:DCorbit@connx.com]
Sent: Thursday, June 03, 2004 6:59 PM
To: Hegedus, Tamas .; pgsql-general@postgresql.org
Subject: Re: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?

How does this query perform:

SELECT p.name, p.seq
FROM prots p, kwx k
WHERE
        p.fid=k.fid
        AND
        k.kw_acc=812
;

> -----Original Message-----
> From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> Sent: Thursday, June 03, 2004 6:48 PM
> To: 'pgsql-general@postgresql.org'
> Subject: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> Dear All,
>
> I am a biologist and I do not know what to expect from an RDB
> (PgSQL). I have large tables: 1215607 rows in prots, 2184596
> rows in kwx (see table details below). I would like to do
> something like that:
>
> SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx
> WHERE kw_acc=812);
>
> After executing this (either as a subquery or joins) the
> best/fastest result what I had (SET enable_seqscan=off):
> 83643.482 ms (see EXPLAIN ANALYZE below).
>
> The two (similar) parts of this query are executed much
> faster: SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms,
> n(rows)=78050 SELECT name, seq FROM prots WHERE fid < 80000
> -- takes 1969.231 ms
>
> Is this realistic? OK?
> If not: how can I increase the speed by fine tuning of the
> RDB (indexes, run-time parameters) or my SQL query? (It came
> now into my mind: if I decrease the number of columns in the
> prots table (to have only 3 fields (fid, name, seq) instead
> of 20 columns), than the prots table will have smaller file
> size on disk, than this table may need less disk page
> fetches, queries may be faster. Is this true?)
>
> Thanks for your help!
> Tamas
>
> ===============================================
> Table "public.prots"
>   Column   |         Type          | Modifiers
> -----------+----------------------+----------
>  fid       | integer               | not null
>  name      | character varying(10) | not null
> [...other 17 columns...]
>  seq       | text                  |
> Indexes:
>     "prots_pkey" primary key, btree (fid)
>     "ix_prots_acc" unique, btree (acc)
>     "ix_prots_name" unique, btree (name)
>     "ix_prots_class" btree ("class")
> ===============================================
> Table "public.kwx"
>  Column |  Type   | Modifiers
> --------+--------+----------
>  fid    | integer |
>  kw_acc | integer |
> Indexes:
>     "ix_kwx_acc" btree (kw_acc)
>     "ix_kwx_fid" btree (fid)
> Foreign-key constraints:
>     "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
>     "fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
> ===============================================
>
> EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on
> (prots.fid=kwx.fid) where kwx.kw_acc = 812;
>                                                              
>     QUERY PLAN                                               
>                                                                   
> --------------------------------------------------------------
> --------------------------------------------------------------
> ------------------
>  Merge Join  (cost=0.00..160429.66 rows=84473 width=349)
> (actual time=29.039..83505.629 rows=78050 loops=1)
>    Merge Cond: ("outer".fid = "inner".fid)
>    ->  Index Scan using ix_kwx_fid on kwx 
> (cost=0.00..44987.55 rows=84473 width=4) (actual
> time=18.893..5730.468 rows=78050 loops=1)
>          Filter: (kw_acc = 812)
>    ->  Index Scan using prots_pkey on prots 
> (cost=0.00..112005.24 rows=981127 width=353) (actual
> time=0.083..76059.235 rows=1210377 loops=1)  Total runtime:
> 83643.482 ms (6 rows)
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: TimeOf(Subselects|Joins)FromLargeTables?

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> Sent: Friday, June 04, 2004 12:37 PM
> To: Dann Corbit
> Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> Dear Dann,
>
> However, the pgsql developers do not recommend to use hash
> indexes, the speed doubled using hash indexes.
> ("Note: Testing has shown PostgreSQL's hash indexes to
> perform no better than B-tree indexes, and the index size and
> build time for hash indexes is much worse. For these reasons,
> hash index use is presently discouraged.")
>
> Question:
> If I created hash indexes, the query planner did not use
> them; it used the b-tree indexes. I had to drop the b-tree
> indexes to use the hash indexes. Why? Can I have 2 types of
> indexes on the same column and force the planner to use what I want?

Hash indexes and btree indexes work very differently.  Normally, hash
indexes are best for equal joins and btree indexes are best for any join
that is not based on exact matching.  So (for instance):

    WHERE colname = some_constant
Or
    WHERE colname IN (constant1, constant2, constant3,...,constantk)

Would be better handled by a hash index

And
    WHERE colname <= some_constant
Or
    WHERE colname LIKE 'const%'

Would be better handled by a btree

In most database systems I have used, you could make any sorts of
indexes on any sorts of columns in any sorts of combinations up to some
limits (e.g. 16 columns max).

The planner should be smart enough to pick the best thing.  I think that
the planner is simply making a mistake here.  Perhaps the PostgreSQL
gurus can help somehow forcing the plan.

There seems to be something seriously defective with hash indexes in old
versions of PostgreSQL.  I thought that it had been repaired in the
latest versions of PostgreSQL starting with 7.4 and above but maybe it
is not completely fixed yet.

> Thanks,
> Tamas
>
> ===================================================================
> explain ANALYZE select p.name, p.seq from prots p, kwx k
> where p.fid=k.fid AND k.kw_acc=812;
>
> QUERY PLAN
> --------------------------------------------------------------
> --------------------------------------------------------------
> -----------
>  Nested Loop  (cost=0.00..662253.63 rows=69180 width=344)
> (actual time=43.337..36076.045 rows=78050 loops=1)
>    ->  Index Scan using ix_kwx_acc on kwx k
> (cost=0.00..245382.38 rows=69179 width=4) (actual
> time=0.109..422.159 rows=78050 loops=1)
>          Index Cond: (kw_acc = 812)
>    ->  Index Scan using prt_fid_ix on prots p
> (cost=0.00..6.01 rows=1 width=348) (actual time=0.414..0.450
> rows=1 loops=78050)
>          Index Cond: (p.fid = "outer".fid)
>  Total runtime: 36134.105 ms
> ===================================================================
>
> -----Original Message-----
> From: Dann Corbit [mailto:DCorbit@connx.com]
> Sent: Thursday, June 03, 2004 7:36 PM
> To: Hegedus, Tamas .
> Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> With a subselect, it limits the optimizers choice of which
> table to process first.
>
> It might be worth trying a hashed index on fid for both
> tables, since you are doing an equal join.
>
> > -----Original Message-----
> > From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> > Sent: Thursday, June 03, 2004 7:33 PM
> > To: Dann Corbit
> > Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> >
> >
> > A little bit better. But I do not understand why?
> >
> > EXPLAIN ANALYZE select p.name, p.seq from prots p, kwx k
> > where p.fid=k.fid AND k.kw_acc=812;
> >
> >      QUERY PLAN
> > --------------------------------------------------------------
> > --------------------------------------------------------------
> > --------------------
> >  Merge Join  (cost=0.00..160429.66 rows=84473 width=349)
> > (actual time=0.263..69192.828 rows=78050 loops=1)
> >    Merge Cond: ("outer".fid = "inner".fid)
> >    ->  Index Scan using ix_kwx_fid on kwx k
> > (cost=0.00..44987.55 rows=84473 width=4) (actual
> > time=0.137..5675.701 rows=78050 loops=1)
> >          Filter: (kw_acc = 812)
> >    ->  Index Scan using prots_pkey on prots p
> > (cost=0.00..112005.24 rows=981127 width=353) (actual
> > time=0.059..61816.725 rows=1210377 loops=1)  Total runtime:
> > 69251.488 ms
> >
> >
> > -----Original Message-----
> > From: Dann Corbit [mailto:DCorbit@connx.com]
> > Sent: Thursday, June 03, 2004 6:59 PM
> > To: Hegedus, Tamas .; pgsql-general@postgresql.org
> > Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> >
> >
> > How does this query perform:
> >
> > SELECT p.name, p.seq
> > FROM prots p, kwx k
> > WHERE
> >     p.fid=k.fid
> >     AND
> >     k.kw_acc=812
> > ;
> >
> > > -----Original Message-----
> > > From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> > > Sent: Thursday, June 03, 2004 6:48 PM
> > > To: 'pgsql-general@postgresql.org'
> > > Subject: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> > >
> > >
> > > Dear All,
> > >
> > > I am a biologist and I do not know what to expect from an RDB
> > > (PgSQL). I have large tables: 1215607 rows in prots,
> 2184596 rows in
> > > kwx (see table details below). I would like to do something like
> > > that:
> > >
> > > SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM
> kwx WHERE
> > > kw_acc=812);
> > >
> > > After executing this (either as a subquery or joins) the
> > > best/fastest result what I had (SET enable_seqscan=off):
> 83643.482
> > > ms (see EXPLAIN ANALYZE below).
> > >
> > > The two (similar) parts of this query are executed much
> > > faster: SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms,
> > > n(rows)=78050 SELECT name, seq FROM prots WHERE fid < 80000
> > > -- takes 1969.231 ms
> > >
> > > Is this realistic? OK?
> > > If not: how can I increase the speed by fine tuning of the RDB
> > > (indexes, run-time parameters) or my SQL query? (It came
> now into my
> > > mind: if I decrease the number of columns in the prots table (to
> > > have only 3 fields (fid, name, seq) instead of 20
> columns), than the
> > > prots table will have smaller file size on disk, than
> this table may
> > > need less disk page fetches, queries may be faster. Is this true?)
> > >
> > > Thanks for your help!
> > > Tamas
> > >
> > > ===============================================
> > > Table "public.prots"
> > >   Column   |         Type          | Modifiers
> > > -----------+----------------------+----------
> > >  fid       | integer               | not null
> > >  name      | character varying(10) | not null
> > > [...other 17 columns...]
> > >  seq       | text                  |
> > > Indexes:
> > >     "prots_pkey" primary key, btree (fid)
> > >     "ix_prots_acc" unique, btree (acc)
> > >     "ix_prots_name" unique, btree (name)
> > >     "ix_prots_class" btree ("class")
> > > ===============================================
> > > Table "public.kwx"
> > >  Column |  Type   | Modifiers
> > > --------+--------+----------
> > >  fid    | integer |
> > >  kw_acc | integer |
> > > Indexes:
> > >     "ix_kwx_acc" btree (kw_acc)
> > >     "ix_kwx_fid" btree (fid)
> > > Foreign-key constraints:
> > >     "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
> > >     "fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
> > > ===============================================
> > >
> > > EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on
> > > (prots.fid=kwx.fid) where kwx.kw_acc = 812;
> > >
> > >     QUERY PLAN
> > >
>
> > > --------------------------------------------------------------
> > > --------------------------------------------------------------
> > > ------------------
> > >  Merge Join  (cost=0.00..160429.66 rows=84473 width=349) (actual
> > > time=29.039..83505.629 rows=78050 loops=1)
> > >    Merge Cond: ("outer".fid = "inner".fid)
> > >    ->  Index Scan using ix_kwx_fid on kwx
> > > (cost=0.00..44987.55 rows=84473 width=4) (actual
> > > time=18.893..5730.468 rows=78050 loops=1)
> > >          Filter: (kw_acc = 812)
> > >    ->  Index Scan using prots_pkey on prots
> > > (cost=0.00..112005.24 rows=981127 width=353) (actual
> > > time=0.083..76059.235 rows=1210377 loops=1)  Total runtime:
> > > 83643.482 ms (6 rows)
> > >
> > >
> > >
> > > ---------------------------(end of
> > > broadcast)---------------------------
> > > TIP 1: subscribe and unsubscribe commands go to
> > > majordomo@postgresql.org
> > >
> >
>