optimizing 2-table join w/millions of rows - Mailing list pgsql-sql

From Michael Olivier
Subject optimizing 2-table join w/millions of rows
Date
Msg-id 19981120022608.22274.rocketmail@send102.yahoomail.com
Whole thread Raw
Responses Re: [SQL] optimizing 2-table join w/millions of rows  (Herouth Maoz <herouth@oumail.openu.ac.il>)
List pgsql-sql
Hi,

I'm trying to optimize a query between two large tables, which
ultimately will have millions of entries each. I'd appreciate any
suggestions folks have on optimizing. I've tried several things but
haven't gotten the performance I need yet.

The tables are a user table and a background table. Each user can have
multiple backgrounds, that's why they're separate tables. Relationship
is:

     1 user <--------> many backgrounds

 +------------------+----------------------------------+----------+
 |  Owner           |             Relation             |   Type   |
 +------------------+----------------------------------+----------+
 | michael          | bgndtest                         | table    |
 | michael          | bgndacctname_key                 | index    |
 | michael          | bgndage_key                      | index    |
 | michael          | bgndloc_key                      | index    |
 | michael          | usertest                         | table    |
 | michael          | userage_key                      | index    |
 | michael          | userloc_key                      | index    |
 | michael          | usertestkey                      | index    |
 +------------------+----------------------------------+----------+


Table    = bgndtest
+------------------------+---------------------------------+-------+
|              Field     |              Type               | Length|
+------------------------+---------------------------------+-------+
| acctname               | text                            |   var |*
| part_needed            | int4                            |     4 |*
| loc_needed             | int4                            |     4 |*
+------------- ----------+---------------------------------+-------+


Table    = usertest
+-------------------------+---------------------------------+-------+
|              Field      |              Type               | Length|
+-------------------------+---------------------------------+-------+
| acctname                | text                            |   var |*
| email                   | text                            |   var |
| part                    | int4                            |     4 |*
| loc                     | int4                            |     4 |*
+-------------------------+---------------------------------+-------+

* indicates a single column index is available


select U.acctname from usertest U, bgndtest B where
    B.part_needed=3 and B.loc_needed=5 and
    B.acctname=U.acctname and U.acctname in
        (select acctname from usertest where part=2 and loc=3)

(query plan at bottom of this message)

There are other similar queries I'll need to do which will have even
more conditions in the where clause than this...

I noticed it's about 4x slower when I change just one of the =
comparisons to a < ... but I thought with btree indexes, < comparisons
would run fast. Aren't btrees the default for indexes? I ultimately
will be doing a lot of range comparisons, not equality.

If I change "U.acctname in (select...)" to "B.acctname in
(select...)", is there any difference in performance? Seems to be but
I don't know why.

I'm currently benchmarking performance that degrades 4x for every 2x
growth in the tables. Presumably the performance is linear to (users *
backgrounds), and in my tests the number of users rows and backgrounds
rows is the same. But this won't be acceptable once I get to millions
of users,  when 20k users costs 3.4 secs on this query (RH 5.1,
Postgres 6.3.2-10, 350 MHz Pentium II).

Re. optimization tips from the FAQ:
* vacuum analyze made a world of difference (10x) after creating indexes
  and adding tons of test data
* I've tried running postmaster with -o -F to get rid of fsync(), didn't
  seem to help much
* I've tried using -B 1280 because I'm happy to eat up memory for
  performance, didn't notice a big difference - where can I read
  more about using this parameter?
* I'm not doing a lot of sorts, so increasing memory for sorts
  won't help, I don't think
* Cluster doesn't seem like it will help because I'm finding rows
  based on several parameters
* returning OID from subquery and indexing OID column didn't seem to
  help performance

Is there any way to get postgres to load a whole table into memory?
I'd like to do some benchmarking with small number of table entries to
find out how it's going to scale. In production, I can buy a machine
with lots of memory and take the altavista approach, but this requires
a DB that will hold tables in memory.

If you think I'd be better off with another DBMS (could be Linux or
NT, must have eval period of 60 days or more, or be cheap or free) --
please let me know.

Thanks very much for any input, pointers to docs, etc! I'm new to
query optimization and I'm working hard to get an understanding of how
PostgreSQL will scale for my project.

--Michael

----------------

query plan:

Nested Loop  (cost=11.09 size=1 width=24)
  ->  Index Scan on b  (cost=9.04 size=1 width=12)
        SubPlan
          ->  Index Scan on usertest  (cost=2.05 size=1 width=12)
  ->  Index Scan on u  (cost=2.05 size=10200 width=12)

verbose query plan:

{NESTLOOP :cost 11.0861  :size 1  :width 24  :state <> :qptargetlist
({TLE :resdom {RESDOM :resno 1  :restype 25  :restypmod -1  :resname
acctname :reskey 0  :reskeyop 0  :resjunk 0 } :expr {VAR :varno 65001
:varattno 1  :vartype 25  :vartypmod -1  :varlevelsup 0  :varnoold 1
:varoattno 1 }}) :qpqual <> :lefttree {INDEXSCAN :cost 9.03614  :size
1  :width 12  :state <> :qptargetlist ({TLE :resdom {RESDOM :resno 1
:restype 25  :restypmod -1  :resname <> :reskey 0  :reskeyop 0
:resjunk 0 } :expr {VAR :varno 1  :varattno 1  :vartype 25  :vartypmod
-1  :varlevelsup 0  :varnoold 1  :varoattno 1 }}) :qpqual ({EXPR
:typeOid 0  :opType op :oper {OPER :opno 96  :opid 65  :opresulttype
16 } :args ({VAR :varno 1  :varattno 2  :vartype 23  :vartypmod -1
:varlevelsup 0  :varnoold 1  :varoattno 2 } {CONST :consttype 23
:constlen 4  :constisnull false :constvalue  4 [  26  0  0  0 ]
:constbyval true})} {EXPR :typeOid 16  :opType subp :oper {SUBPLAN
:plan {INDEXSCAN :cost 2.05  :size 1  :width 12  :state <>
:qptargetlist ({TLE :resdom {RESDOM :resno 1  :restype 25  :restypmod
-1  :resname acctname :reskey 0  :reskeyop 0  :resjunk 0 } :expr {VAR
:varno 1  :varattno 1  :vartype 25  :vartypmod -1  :varlevelsup 0
:varnoold 1  :varoattno 1 }}) :qpqual ({EXPR :typeOid 0  :opType op
:oper {OPER :opno 96  :opid 65  :opresulttype 16 } :args ({VAR :varno
1  :varattno 3  :vartype 23  :vartypmod -1  :varlevelsup 0  :varnoold
1  :varoattno 3 } {CONST :consttype 23  :constlen 4  :constisnull
false :constvalue  4 [  12  0  0  0 ]  :constbyval true})}) :lefttree
<> :righttree <> :extprm () :locprm () :initplan <> :nprm 0
:scanrelid 1  :indxid ( 189085 ) :indxqual (({EXPR :typeOid 0  :opType
op :oper {OPER :opno 96  :opid 65  :opresulttype 16 } :args ({VAR
:varno 1  :varattno 1  :vartype 23  :vartypmod -1  :varlevelsup 0
:varnoold 1  :varoattno 4 } {CONST :consttype 23  :constlen 4
:constisnull false :constvalue  4 [  27  0  0  0 ]  :constbyval
true})}))} :planid 0  :rtable ({RTE :relname usertest :refname
usertest :relid 189056  :inh false :inFromCl true :skipAcl false})
:setprm () :parprm () :slink {SUBLINK :subLinkType 2  :useor false
:lefthand ({VAR :varno 1  :varattno 1  :vartype 25  :vartypmod -1
:varlevelsup 0  :varnoold 1  :varoattno 1 }) :oper ({EXPR :typeOid 16
:opType op :oper {OPER :opno 98  :opid 67  :opresulttype 16 } :args
({VAR :varno 1  :varattno 1  :vartype 25  :vartypmod -1  :varlevelsup
0  :varnoold 1  :varoattno 1 } {CONST :consttype 25  :constlen 0
:constisnull true :constvalue <> :constbyval true})}) :subselect <>}}
:args <>}) :lefttree <> :righttree <> :extprm () :locprm () :initplan
<> :nprm 0  :scanrelid 1  :indxid ( 189094 ) :indxqual (({EXPR
:typeOid 0  :opType op :oper {OPER :opno 96  :opid 65  :opresulttype
16 } :args ({VAR :varno 1  :varattno 1  :vartype 23  :vartypmod -1
:varlevelsup 0  :varnoold 1  :varoattno 3 } {CONST :consttype 23
:constlen 4  :constisnull false :constvalue  4 [  5  0  0  0 ]
:constbyval true})}))} :righttree {INDEXSCAN :cost 2.05  :size 10200
:width 12  :state <> :qptargetlist ({TLE :resdom {RESDOM :resno 1
:restype 25  :restypmod -1  :resname <> :reskey 0  :reskeyop 0
:resjunk 0 } :expr {VAR :varno 2  :varattno 1  :vartype 25  :vartypmod
-1  :varlevelsup 0  :varnoold 2  :varoattno 1 }}) :qpqual <> :lefttree
<> :righttree <> :extprm () :locprm () :initplan <> :nprm 0
:scanrelid 2  :indxid ( 189079 ) :indxqual (({EXPR :typeOid 0  :opType
op :oper {OPER :opno 98  :opid 67  :opresulttype 16 } :args ({VAR
:varno 65001  :varattno 1  :vartype 25  :vartypmod -1  :varlevelsup 0
:varnoold 1  :varoattno 1 } {VAR :varno 2  :varattno 1  :vartype 25
:vartypmod -1  :varlevelsup 0  :varnoold 2  :varoattno 1 })}))}
:extprm () :locprm () :initplan <> :nprm 0 }

_________________________________________________________
DO YOU YAHOO!?
Get your free @yahoo.com address at http://mail.yahoo.com


pgsql-sql by date:

Previous
From: "G. Anthony Reina"
Date:
Subject: Postgres installed but my C programs won't compile
Next
From: sqyang
Date:
Subject: optimization