problem with huge joins - Mailing list pgsql-general

From Kolus Maximiliano
Subject problem with huge joins
Date
Msg-id A56C8EE648A74A4F9653B071E9F7142002085FBC@bcrweb.bcr.intranet.com.ar
Whole thread Raw
Responses Re: problem with huge joins  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general

Hello,

        I don't know if this is *entirely* a postgresql question or if it's my screwup with the database design.

        First, a little introduction to the problem: I maintain a database of open proxies (they're used for tagging and/or rejecting email) and I use postgres as it's backend. The list itself is contained in three tables:

tHost (458K rows)
        Hostid  integer (PK + HASH INDEX)
        Ip              inet
        Rdns            varchar

tList (403K rows)
        Listid  integer (PK)
        Input           integer
        Output  integer (HASH INDEX)
        Date            timestamp

tPort (986K rows)
        Portid  integer (PK)
        Listid  integer (HASH INDEX)
        Protocol        integer
        Port            integer (BTREE INDEX)

        Every hour, I must build a couple of files with the list itself, in order to do this I need to retrieve output ip address (list+host) and for each ip address port/protocol combinations, ie:

192.168.1.1             1080            3 (for socks 4)

        In order to get this, I run this query:

SELECT ip, TO_CHAR(date, 'YYYY-MM-DD'), protocol, port
FROM tProxyPort, tProxyList, tProxyHost
WHERE tProxyPort.listId=tProxyList.listId
        AND tProxyList.output=tProxyHost.hostId
ORDER BY ip, port

        Whose query plan is:

Sort  (cost=311874.07..311874.07 rows=986130 width=44) (actual time=300086.42..302580.25 rows=986130 loops=1)
  ->  Hash Join  (cost=39735.96..96907.83 rows=986130 width=44) (actual time=86226.28..223195.50 rows=986130 loops=1)
        ->  Seq Scan on tport  (cost=0.00..18629.30 rows=986130 width=12) (actual time=0.15..25910.56 rows=986130 loops=1)

        ->  Hash  (cost=35972.38..35972.38 rows=403034 width=32) (actual time=86194.99..86194.99 rows=0 loops=1)
              ->  Hash Join  (cost=9787.92..35972.38 rows=403034 width=32) (actual time=12180.64..84316.65 rows=403927 loops=1)

                    ->  Seq Scan on thost  (cost=0.00..7850.41 rows=457341 width=16) (actual time=619.09..10032.85 rows=458787 loops=1)

                    ->  Hash  (cost=6812.34..6812.34 rows=403034 width=16) (actual time=6656.36..6656.36 rows=0 loops=1)

                          ->  Seq Scan on tlist  (cost=0.00..6812.34 rows=403034 width=16) (actual time=6.90..5030.22 rows=403927 loops=1)

Total runtime: 317046.69 msec

        As you can see, it takes eons to complete and I couldn't find a way to make it faster (the hash indexes are one of the attempts and yes, I did a vacuum). I don't think that a table with nearly one million entries is "that much" for postgres and I'm convinced I'm doing something wrong. Any ideas on how to speed this thing up will be appreciated.

Version:
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
(shared buffers: 128)

CPU:
Pentium III (Katmai) (600Mhz)

/proc/meminfo:

        total:    used:    free:  shared: buffers:  cached:
Mem:  263475200 218435584 45039616        0 17240064 161374208
Swap: 511959040  7389184 504569856
MemTotal:       257300 kB
MemFree:         43984 kB
MemShared:           0 kB
Buffers:         16836 kB
Cached:         155348 kB
SwapCached:       2244 kB
Active:         131204 kB
Inactive:        68760 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       257300 kB
LowFree:         43984 kB
SwapTotal:      499960 kB
SwapFree:       492744 kB

        Thanks in advance.
       

pgsql-general by date:

Previous
From: Michael Meskes
Date:
Subject: Re: Problem with ecpg
Next
From: Carmen Gloria Sepulveda Dedes
Date:
Subject: Re: Problem with ecpg