Re: looking for a faster way to do that - Mailing list pgsql-general

From Alban Hertroys
Subject Re: looking for a faster way to do that
Date
Msg-id CAF-3MvNz1CSwnPh14grK80pQV=bNhTFya_zt0BnQ5s0um38zag@mail.gmail.com
Whole thread Raw
In response to Re: looking for a faster way to do that  (hamann.w@t-online.de)
Responses Re: looking for a faster way to do that
List pgsql-general
On 23 September 2011 09:45, <hamann.w@t-online.de> wrote:
Alban Hertroys <haramrae@gmail.com> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe


Here I created a subset (just number and code matching a certain prefix)

\d items
         Table "pg_temp_1.items"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 num    | integer               |
 code   | character varying(40) |
create index itemsc on items (code);

select count(*) from items;
 count
-------
 9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
 Nested Loop  (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
  Join Filter: (("outer".code)::text ~ "inner".wantcode)
  ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
  ->  Materialize  (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
        ->  Seq Scan on n  (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
 Total runtime: 336286.692 ms

So you're comparing a variable field value to a variable pattern - yeah, that's going to hurt. There's no way you could index exactly that.

Perhaps there's some way you can transform the problem so that you get something indexable?
For example, if your match patterns follow a certain pattern by themselves, you could add a column with the longest match pattern that would match the string. Then you could just do a query for which records have the match pattern (in that new column) that you're looking for and voila!

If something like that is possible strongly depends on what kind of match patterns you're using, of course.
 
An exact match  "where items.code = n.wantcode" on the same data completes in 40 ms

That's an exact string match, of course that will be fast ;)

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

pgsql-general by date:

Previous
From: Gregg Jaskiewicz
Date:
Subject: get number and names of processes connected to postgresql
Next
From: Alban Hertroys
Date:
Subject: Re: Query performs badly with materialize node