Thread: TimeOf(Subselects|Joins)FromLargeTables?

TimeOf(Subselects|Joins)FromLargeTables?

From

"Hegedus, Tamas ."

Date:

05 June 2004, 00:17:59

Dear All,

(I asked the general-list, I had some advice, but the final conclusion was to ask the developers and tell them a
possiblemistake of the query planner, too.)
 

I am a biologist and I do not know what to expect from an RDB (PgSQL).
I have large tables: 1215607 rows in prots, 2184596 rows in kwx (see table details below). I would like to do something
likethat:
 

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812);

After executing this (either as a subquery or joins) one of the best/fastest result what I had (SET
enable_seqscan=off):
83643.482 ms (see EXPLAIN ANALYZE below).

The two (similar) parts of this query are executed much faster:
SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms, n(rows)=78050
SELECT name, seq FROM prots WHERE fid < 80000 -- takes 1969.231 ms

Is this realistic? OK?
If not: how can I increase the speed by fine tuning of the RDB (indexes, run-time parameters) or my SQL query?
(It came now into my mind: if I decrease the number of columns in the prots table (to have only 3 fields (fid, name,
seq)instead of 20 columns), than the prots table will have smaller file size on disk, than this table may need less
diskpage fetches, queries may be faster. Is this true?)
 

---------------------------------------------
Dann[GENERAL] suggested to use hash-index on prot.fid and kwx.fid.
However, you suggested not to use hash index ("Note: Testing has shown PostgreSQL's hash indexes to perform no better
thanB-tree indexes, and the index size and build time for hash indexes is much worse. For these reasons, hash index use
ispresently discouraged."), the query speed increased: 36134.105 ms
 

According to Dann the planner should have used the hash-indexes instead of b-trees. But I had to remove the b-trees to
forcethe planner to use the hash-indexes. Is this normal, or is there a softer way to force hash-index using of the
planner(coexisting hash and b-tree type indexes for the same column).
 

---------------------------------------------
What should I expect from the following (similar with joins) queries?
How should I optimize the indexes? Which one to use?

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812 AND kw_acc=215);
SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812 OR kw_acc=215);

---------------------------------------------
When I executed the same explain analyze select etc., the speed decreased to 5s. This means the statistics of the
planneris pretty good. In order to test a real situation (when I do not have statistic), how can I clear this
statistics?

Thanks for your help!
Tamas

===============================================
Table "public.prots" Column   |         Type          | Modifiers
-----------+----------------------+----------fid       | integer               | not nullname      | character
varying(10)| not null
 
[...other 17 columns...]seq       | text                  |
Indexes:   "prots_pkey" primary key, btree (fid)   "ix_prots_acc" unique, btree (acc)   "ix_prots_name" unique, btree
(name)  "ix_prots_class" btree ("class")
 
===============================================
Table "public.kwx"Column |  Type   | Modifiers
--------+--------+----------fid    | integer |kw_acc | integer |
Indexes:   "ix_kwx_acc" btree (kw_acc)   "ix_kwx_fid" btree (fid)
Foreign-key constraints:   "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)   "fk_kws_fid" FOREIGN KEY (fid)
REFERENCESprots(fid)
 
===============================================

EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 812;
                                                 QUERY PLAN
                                                     
 

----------------------------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=0.00..160429.66 rows=84473 width=349) (actual time=29.039..83505.629 rows=78050 loops=1)  Merge Cond:
("outer".fid= "inner".fid)  ->  Index Scan using ix_kwx_fid on kwx  (cost=0.00..44987.55 rows=84473 width=4) (actual
time=18.893..5730.468rows=78050 loops=1)        Filter: (kw_acc = 812)  ->  Index Scan using prots_pkey on prots
(cost=0.00..112005.24rows=981127 width=353) (actual time=0.083..76059.235 rows=1210377 loops=1)Total runtime: 83643.482
ms
(6 rows)

===================================================================
--with hash-indexes
===================================================================
explain ANALYZE select p.name, p.seq from prots p, kwx k where p.fid=k.fid AND k.kw_acc=812;
                                QUERY PLAN        
 

---------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.00..662253.63 rows=69180 width=344) (actual time=43.337..36076.045 rows=78050 loops=1)  ->  Index Scan
usingix_kwx_acc on kwx k  (cost=0.00..245382.38 rows=69179 width=4) (actual time=0.109..422.159 rows=78050 loops=1)
  Index Cond: (kw_acc = 812)  ->  Index Scan using prt_fid_ix on prots p  (cost=0.00..6.01 rows=1 width=348) (actual
time=0.414..0.450rows=1 loops=78050)        Index Cond: (p.fid = "outer".fid)Total runtime: 36134.105 ms
 

====================================================================
EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 812;
                                         QUERY PLAN            
 

------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.00..560203.74 rows=86657 width=343) (actual time=0.139..13924.698 rows=78050 loops=1)  ->  Seq Scan on kwx
(cost=0.00..38016.45 rows=86656 width=4) (actual time=0.065..3363.545 rows=78050 loops=1)        Filter: (kw_acc = 812)
->  Index Scan using prt_fid_ix on prots  (cost=0.00..6.01 rows=1 width=347) (actual time=0.111..0.129 rows=1
loops=78050)       Index Cond: (prots.fid = "outer".fid)Total runtime: 13981.767 ms
 


=====================================================================
EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 813;
                                         QUERY PLAN                          
 

------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.00..288142.81 rows=41509 width=343) (actual time=0.974..8645.423 rows=42388 loops=1)  ->  Seq Scan on kwx
(cost=0.00..38016.45rows=41508 width=4) (actual time=0.926..1161.183 rows=42388 loops=1)        Filter: (kw_acc = 813)
-> Index Scan using prt_fid_ix on prots  (cost=0.00..6.01 rows=1 width=347) (actual time=0.151..0.170 rows=1
loops=42388)       Index Cond: (prots.fid = "outer".fid)Total runtime: 8676.116 ms
 
(6 rows)

Re: TimeOf(Subselects|Joins)FromLargeTables?

From

"Dann Corbit"

Date:

05 June 2004, 00:42:39

> -----Original Message-----
> From: Hegedus, Tamas . [mailto:Hegedus.Tamas@mayo.edu]
> Sent: Friday, June 04, 2004 5:18 PM
> To: 'pgsql-hackers@postgresql.org'
> Subject: [HACKERS] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> Dear All,
>
[snip]
> ---------------------------------------------
> What should I expect from the following (similar with joins)
> queries? How should I optimize the indexes? Which one to use?
>
> SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx
> WHERE kw_acc=812 AND kw_acc=215);

This query is a false tautology and hence will return zero rows.  The
attribute kwx cannot be simultaneously 812 and 215.

> SELECT name, seq FROM prots
> WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812 OR kw_acc=215);

I suspect that this query would be better formulated as:
SELECT prots.name, prots.seq
FROM prots, kwx
WHERE      prots.fid = kwx.fid AND kwx.kw_acc IN (812, 215);

I think joins give the planner better options than subselects and also
an in-list is going to do better than a list of constants separated by
OR.

Though you should test them both and see what the planner says.

[snip]

Re: TimeOf(Subselects|Joins)FromLargeTables?

From

Tom Lane

Date:

05 June 2004, 04:56:05

"Dann Corbit" <DCorbit@connx.com> writes:
> I think ... an in-list is going to do better than a list of constants
> separated by OR.

Right at the moment, there is no performance difference betweenWHERE foo IN (const1, const2, const3)
andWHERE foo = const1 OR foo = const2 OR foo = const3
because in fact the parser expands the former into the latter at
a pretty early stage.

Still, I'd suggest using the IN form when you can.  It's more
concise, arguably less prone to typos, and someday we might even
implement it more efficiently than the other form.
        regards, tom lane