how can this get faster - Mailing list pgsql-general

From hamann.w@t-online.de
Subject how can this get faster
Date
Msg-id wolfgang-1130422192633.A0128168@amadeus3.local
Whole thread Raw
List pgsql-general
Hi,

I have a table with mainly a list of keywords and a few odd columns referring to their origin.
There is a standard btree index, and searches are carried out as anchored regex.
Entries in the table and queries are uppercased; this turned out to have a big speed
advantage over a ~* condition.
So queries are like:
select * from mytable where word ~ '^CARNEVAL'
The table currently holds 15.000.000 rows, and query times are on the order of 10-20 msec

Now I am trying something different: I have a temp table with words appearing in the table
plus some extra data, and I want to get the other columns from that table too

select aux.col1, aux.col2, mytable.inf1, mytable.inf2 from aux, mytable where aux.word = mytable.word;

I specifically do
analyze aux;
before the operation. On a sample my aux table has 433 entries (different words), and I get
this behaviour
 Hash Join  (cost=12.74..321957.88 rows=472279 width=12) (actual time=0.530..2242.837 rows=29342 loops=1)
   Hash Cond: (mytable.word = aux.word)
   ->  Seq Scan on mytable  (cost=0.00..259779.89 rows=15317989 width=12) (actual time=0.009..808.577 rows=15317717
loops=1)
   ->  Hash  (cost=7.33..7.33 rows=433 width=14) (actual time=0.173..0.173 rows=433 loops=1)
         ->  Seq Scan on aux  (cost=0.00..7.33 rows=433 width=14) (actual time=0.004..0.063 rows=433 loops=1)
 Total runtime: 2243.924 ms

experimenting with different query data (i.e. different collections of words) reveals a
strong dependency on actual data. In particular, I split the above in two subtables.
The first one containing 45 rows produced
 Nested Loop  (cost=31.41..169444.22 rows=49082 width=12) (actual time=0.040..37.739 rows=590 loops=1)
   ->  Seq Scan on aux1  (cost=0.00..1.45 rows=45 width=18) (actual time=0.003..0.011 rows=45 loops=1)
   ->  Bitmap Heap Scan on mytable  (cost=31.41..3751.76 rows=1091 width=12) (actual time=0.014..0.054 rows=13
loops=45)
         Recheck Cond: (mytable.word = aux1.word)
         ->  Bitmap Index Scan on mytable_word_idx  (cost=0.00..31.14 rows=1091 width=0) (actual time=0.008..0.008
rows=13loops=45) 
               Index Cond: (mytable.word = aux1.word)
 Total runtime: 37.800 ms
while the second one (remaining 388 rows) resulted in another sequential scan on the large table,
(and even no result within 3 minutes, when omitting the analyze step

Is there a way to "help" the planner choose the index scan

Regards
Wolfgang Hamann




pgsql-general by date:

Previous
From: Rafał Pietrak
Date:
Subject: Re: using text search
Next
From: Steven Schlansker
Date:
Subject: Using an index for IS DISTINCT FROM queries