Optimizer refuses to hash join - Mailing list pgsql-performance

From Stan Bielski
Subject Optimizer refuses to hash join
Date
Msg-id Pine.LNX.3.96L.1040727152629.26493I-100000@elysium.pdl.cmu.edu
Whole thread Raw
Responses Re: Optimizer refuses to hash join  (Stephan Szabo <sszabo@megazone.bigpanda.com>)
Re: Optimizer refuses to hash join  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Hello,

I having a great deal of difficulty getting postgres to do a hash join.
Even if I disable nestloop and mergejoin in postgres.conf, the optimizer
still refuses to select hash join. This behavior is killing my
performance.

Postgres version is 7.3.2 and relevant tables are vacuum analyzed.

Here's an overview of what I'm doing:

I have one table of network logs ordered by time values. The other table
is a set of hosts (approximately 60) that are infected by a worm. I want
to do this query on the dataset:

standb=# explain SELECT count (allflow_tv_sobig.tv_s) FROM
allflow_tv_sobig, blaster_set WHERE allflow_tv_sobig.src =
blaster_set.label AND allflow_tv_sobig.tv_s >= 1060101118::bigint and
allflow_tv_sobig.tv_s < 1060187518::bigint;
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=185785.06..185785.06 rows=1 width=32)
   ->  Merge Join  (cost=174939.71..184986.38 rows=319472 width=32)
         Merge Cond: ("outer".label = "inner".src)
         ->  Index Scan using blaster_set_x on blaster_set
(cost=0.00..3.67 rows=66 width=12)
         ->  Sort  (cost=174939.71..178073.92 rows=1253684 width=20)
               Sort Key: allflow_tv_sobig.src
               ->  Index Scan using allflow_tv_sobig_x on allflow_tv_sobig
(cost=0.00..47955.63 rows=1253684 width=20)
                     Index Cond: ((tv_s >= 1060101118::bigint) AND (tv_s <
1060187518::bigint))
(8 rows)

Basically I just want to use the smaller table as a filtering mechanism so
that I only get resulted for hosts in that table. Rather than do the
sensible thing, which is scan the list of infected hosts, then scan the
traffic table and ignore entries that aren't in the first list, the
optimizer insists on SORTING the table of network traffic according to
source address. Considering that this table is very large, these queries
are taking forever.

Doing it in a nested loop, while it doesn't require sorting, still takes a
very long time as well.

Is there anyway that I can force the optimizer to do this the right way,
aside from adding each IP manually to a disgustingly bloated 'where'
clause?


Thanks,
-S




pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: best way to fetch next/prev record based on index
Next
From: "Stephane Tessier"
Date:
Subject: my boss want to migrate to ORACLE