Thread: looking for a faster way to do that

looking for a faster way to do that

From

hamann.w@t-online.de

Date:

21 September 2011, 13:00:06


Hi,

I have one large table (about a million entries) with an indexed column containing codes
like ABC3561A, ABC3563X, 72-451-823 etc. (lots of order numbers from different
manufacturers)

When I ask for a specific item
select code .... where code = 'ABC3563X'
I get fast result. I also get fast result when doing a prefix match
select code .... where code ~ '^ABC3563'

If a am retrieving many items by joining with another table
select code ..... where code = wantcode
this is still fast.
If I try to get many items on a prefix match
select code .... where code ~ wantcode
things go very slow. Explain shows a nested loop, so seemingly the table is rescanned
for every wanted item in the other table. A test run (3000 wanted codes against a
shortened table of 10000 ones) took about 200 seconds to complete

What other queries could I use to get the requested selection?

Regards
Wolfgang

Re: looking for a faster way to do that

From

"Albe Laurenz"

Date:

22 September 2011, 09:09:03

hamann.w <mailto:hamann.w@t-online.de>  wrote:Gesendet: Mi 2011-09-21 17:59
> I have one large table (about a million entries) with an indexed column containing codes
> like ABC3561A, ABC3563X, 72-451-823 etc. (lots of order numbers from different
> manufacturers)
>
> When I ask for a specific item
> select code .... where code = 'ABC3563X'
> I get fast result. I also get fast result when doing a prefix match
> select code .... where code ~ '^ABC3563'
>
> If a am retrieving many items by joining with another table
> select code ..... where code = wantcode
> this is still fast.
> If I try to get many items on a prefix match
> select code .... where code ~ wantcode
> things go very slow. Explain shows a nested loop, so seemingly the table is rescanned
> for every wanted item in the other table. A test run (3000 wanted codes against a
> shortened table of 10000 ones) took about 200 seconds to complete
>
> What other queries could I use to get the requested selection?

Is the index used for "where code ~ '^ABC3563'"?

If not, then the result is fast only because the table is scanned only once,
and it's just the factor of 3000 that's killing you.

The second query (where code ~ wantcode) can never use an index because
the pattern "wantcode" is unknown at query planning time.

Yours,
Laurenz Albe

Re: looking for a faster way to do that

From

Alban Hertroys

Date:

22 September 2011, 09:32:16

On 21 September 2011 17:59, <hamann.w@t-online.de> wrote:

If I try to get many items on a prefix match
select code .... where code ~ wantcode
things go very slow. Explain shows a nested loop, so seemingly the table is rescanned
for every wanted item in the other table. A test run (3000 wanted codes against a
shortened table of 10000 ones) took about 200 seconds to complete

What is the output of explain?

You say 'the other table', so presumably we're dealing with a foreign key here. Is there an index on that column?

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

23 September 2011, 04:46:13

Alban Hertroys <haramrae@gmail.com> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe


Here I created a subset (just number and code matching a certain prefix)

\d items
          Table "pg_temp_1.items"
 Column |         Type          | Modifiers
--------+-----------------------+-----------
 num    | integer               |
 code   | character varying(40) |
create index itemsc on items (code);

select count(*) from items;
 count
-------
  9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
 Nested Loop  (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
   Join Filter: (("outer".code)::text ~ "inner".wantcode)
   ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
   ->  Materialize  (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
         ->  Seq Scan on n  (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
 Total runtime: 336286.692 ms

An exact match  "where items.code = n.wantcode" on the same data completes in 40 ms

BTW: indexing the second table does not affect the query plan or the runtime, it just shows
actual row count rather than estimate.

This is, of course, bad; an anchored match could be faster and also is more appropriate
to the scenario. So I change the contents of the second table

update n set wantcode = textcat('^', wantcode);

and try again, with similar results
 Nested Loop  (cost=14.15..176478.01 rows=39178 width=36) (actual time=125.114..308831.697 rows=2871 loops=1)
   Join Filter: (("outer".code)::text ~ "inner".wantcode)
   ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.061..2034.572 rows=9614 loops=1)
   ->  Materialize  (cost=14.15..22.30 rows=815 width=32) (actual time=0.001..1.095 rows=815 loops=9614)
         ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual time=0.114..1.893 rows=815 loops=1)
 Total runtime: 308837.746 ms


I am aware that this is unlikely to work fast (the planner would perhaps need a hint in the query
rather than in the data column to choose an anchored match algorithm (in case there is
such an algo, of course)

So I wonder whether there might be a different approach to this problem rather than
pattern matching.
I recall I had a similar problem before with a "contacts" column possibly containing one or more
email addresses. Here searches would also be number of people times number of requests
performance. I finally ended up with a @@ match (contrib/tsquery) and a supporting GIST index,
but that only supports exact match, not prefix

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

23 September 2011, 07:29:06

At 09:45 23/09/2011, hamann.w@t-online.de wrote:

>A single anchored query
>select * from items where code ~ '^ABC';
>does indeed use the index to retrieve data.
>
>
>So I wonder whether there might be a different approach to this
>problem rather than
>pattern matching.
>I recall I had a similar problem before with a "contacts" column
>possibly containing one or more
>email addresses. Here searches would also be number of people times
>number of requests
>performance. I finally ended up with a @@ match (contrib/tsquery)
>and a supporting GIST index,
>but that only supports exact match, not prefix

You can try these, i doubt they will use any index but its a
different approach:

select * from items where length(items.code)<>length(rtrim(items.code,'ABC'));

select * from items where strpos(items.code,'ABC')=0 or
strpos(items.code,'any_substring')=0;

HTH

Re: looking for a faster way to do that

From

Alban Hertroys

Date:

23 September 2011, 07:39:41

On 23 September 2011 09:45, <hamann.w@t-online.de> wrote:

Alban Hertroys <haramrae@gmail.com> wrote:

>> What is the output of explain?
>>
>> You say 'the other table', so presumably we're dealing with a foreign key
>> here. Is there an index on that column?

Albe Laurenz wrote:

>> Is the index used for "where code ~ '^ABC3563'"?
>>
>> If not, then the result is fast only because the table is scanned only once,
>> and it's just the factor of 3000 that's killing you.
>>
>> The second query (where code ~ wantcode) can never use an index because
>> the pattern "wantcode" is unknown at query planning time.
>>
>> Yours,
>> Laurenz Albe

Here I created a subset (just number and code matching a certain prefix)

\d items
Table "pg_temp_1.items"
Column | Type | Modifiers
--------+-----------------------+-----------
num | integer |
code | character varying(40) |
create index itemsc on items (code);

select count(*) from items;
count
-------
9614

A single anchored query
select * from items where code ~ '^ABC';
does indeed use the index to retrieve data.

Next I copied a file of wanted codes

create temp table n (wantcode text);
\copy n from /tmp/rmartin.tmp

the file contains plain names, i.e. unanchored matches

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms

So you're comparing a variable field value to a variable pattern - yeah, that's going to hurt. There's no way you could index exactly that.

Perhaps there's some way you can transform the problem so that you get something indexable?
For example, if your match patterns follow a certain pattern by themselves, you could add a column with the longest match pattern that would match the string. Then you could just do a query for which records have the match pattern (in that new column) that you're looking for and voila!

If something like that is possible strongly depends on what kind of match patterns you're using, of course.

An exact match "where items.code = n.wantcode" on the same data completes in 40 ms

That's an exact string match, of course that will be fast ;)

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

23 September 2011, 09:12:45

Eduardo Morras wrote:

>> You can try these, i doubt they will use any index but its a
>> different approach:
>>
>> select * from items where length(items.code)<>length(rtrim(items.code,'ABC'));
>>
>> select * from items where strpos(items.code,'ABC')=0 or
>> strpos(items.code,'any_substring')=0;

Hi,

if I understand this right, it does not mean "check if the string appears at position 0"
which could translate into an index query, but rather "check if the string appears anywhere
and then check if that is position 0", so the entire table is checked.

explain analyze select items.num, wantcode from items, n where strpos(code, wantcode) = 0;
 Nested Loop  (cost=167.14..196066.54 rows=39178 width=36) (actual time=0.074..36639.312 rows=7832539 loops=1)
   Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
   ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual time=0.005..2.212 rows=815 loops=1)
   ->  Materialize  (cost=167.14..263.28 rows=9614 width=42) (actual time=0.007..13.970 rows=9614 loops=815)
         ->  Seq Scan on items  (cost=0.00..167.14 rows=9614 width=42) (actual time=0.044..14.855 rows=9614 loops=1)
 Total runtime: 46229.836 ms

The query ran much faster than the pattern query, however. This seems to be the performance
of just searching for a plain string vs. initializing the regex engine every time (for 815
queries in a test set)

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

23 September 2011, 09:29:35

Alban Hertroys wrote:

>> So you're comparing a variable field value to a variable pattern - yeah,
>> that's going to hurt. There's no way you could index exactly that.
>>
>> Perhaps there's some way you can transform the problem so that you get
>> something indexable?
>> For example, if your match patterns follow a certain pattern by themselves,
>> you could add a column with the longest match pattern that would match the
>> string. Then you could just do a query for which records have the match
>> pattern (in that new column) that you're looking for and voila!
>>
>> If something like that is possible strongly depends on what kind of match
>> patterns you're using, of course.

Hi Alban,

I already did that - the test set is just all records from the real table (about a million
entries) that match the common 'ABC' prefix

>>> An exact match  "where items.code = n.wantcode" on the same data completes
>>> in 40 ms
>>>
>>
>> That's an exact string match, of course that will be fast ;)

The main difference is: the fast query looks like

explain select items.num, wantcode from items, n where code = wantcode;
 Merge Join  (cost=53.56..1104.02 rows=39178 width=36)
   Merge Cond: (("outer".code)::text = "inner".wantcode)
   ->  Index Scan using itemsc on items  (cost=0.00..438.75 rows=9614 width=42)
   ->  Sort  (cost=53.56..55.60 rows=815 width=32)
         Sort Key: n.wantcode
         ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32)

and the slow ones looks like that one:

 Nested Loop  (cost=14.15..176478.01 rows=39178 width=36)
   Join Filter: (("outer".code)::text ~ "inner".wantcode)

So the database takes an entirely differnet approach at retrieving the entries.

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

Alban Hertroys

Date:

23 September 2011, 10:21:36

On 23 September 2011 14:29, <hamann.w@t-online.de> wrote:
>
> Alban Hertroys wrote:
>
> >> So you're comparing a variable field value to a variable pattern - yeah,
> >> that's going to hurt. There's no way you could index exactly that.
> >>
> >> Perhaps there's some way you can transform the problem so that you get
> >> something indexable?
> >> For example, if your match patterns follow a certain pattern by themselves,
> >> you could add a column with the longest match pattern that would match the
> >> string. Then you could just do a query for which records have the match
> >> pattern (in that new column) that you're looking for and voila!
> >>
> >> If something like that is possible strongly depends on what kind of match
> >> patterns you're using, of course.
>
> Hi Alban,
>
> I already did that - the test set is just all records from the real table (about a million
> entries) that match the common 'ABC' prefix

I think you misunderstood what I wrote. Notice the difference between
"which strings match the pattern" and "which records have the match
pattern (in that new column)" - the first is a regular expression
match (unindexable), while the second is a string equality match
(indexable).

What I'm suggesting is to add a column, which for the string 'ABCDEFG'
would contain 'ABC%'.
Data would look like:

SELECT str, pattern FROM tbl;
 str     | pattern
---------+---------
 ABCDEFG | ABC%
 ABCDEF  | ABC%
 BCDEFGH | BCD%
 etc.

(can't format this properly in webmail, sorry)

When you look for records that match the pattern 'ABC%', you would
normally perform a query like:

SELECT str FROM tbl WHERE str LIKE 'ABC%';

But with this new column, you would query:

SELECT str FROM tbl WHERE pattern = 'ABC%';

As I said, it depends a lot on your pattern needs whether this
solution would work at all for you. If you only ever use a few
patterns, it will work. If you use many different patterns or don't
know before-hand which patterns will be used, it won't work well at
all.

> The main difference is: the fast query looks like
>
> explain select items.num, wantcode from items, n where code = wantcode;
>  Merge Join  (cost=53.56..1104.02 rows=39178 width=36)
>   Merge Cond: (("outer".code)::text = "inner".wantcode)
>   ->  Index Scan using itemsc on items  (cost=0.00..438.75 rows=9614 width=42)
>   ->  Sort  (cost=53.56..55.60 rows=815 width=32)
>         Sort Key: n.wantcode
>         ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32)

Is there an index on wantcode? If you have a million or more records,
I would expect an index scan for a measly 815 matches...

> and the slow ones looks like that one:
>
>  Nested Loop  (cost=14.15..176478.01 rows=39178 width=36)
>   Join Filter: (("outer".code)::text ~ "inner".wantcode)
>
> So the database takes an entirely differnet approach at retrieving the entries.

Yes, because you're still using ~ there, with a pattern that's unknown
at query planning time. That will only be fast under some fairly rare
circumstances.

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

23 September 2011, 11:03:11

>> >
>> > Hi Alban,
>> >
>> > I already did that - the test set is just all records from the real table=
>>  (about a million
>> > entries) that match the common 'ABC' prefix
>>
>> I think you misunderstood what I wrote. Notice the difference between
>> "which strings match the pattern" and "which records have the match
>> pattern (in that new column)" - the first is a regular expression
>> match (unindexable), while the second is a string equality match
>> (indexable).
>>
>> What I'm suggesting is to add a column, which for the string 'ABCDEFG'
>> would contain 'ABC%'.
>> Data would look like:
>>
>> SELECT str, pattern FROM tbl;
>> =A0str     | pattern
>> ---------+---------
>> =A0ABCDEFG | ABC%
>>  ABCDEF  | ABC%
>>  BCDEFGH | BCD%
>>  etc.
>>
>> (can't format this properly in webmail, sorry)
>>
>> When you look for records that match the pattern 'ABC%', you would
>> normally perform a query like:
>>
>> SELECT str FROM tbl WHERE str LIKE 'ABC%';
>>
>> But with this new column, you would query:
>>
>> SELECT str FROM tbl WHERE pattern =3D 'ABC%';
>>
>> As I said, it depends a lot on your pattern needs whether this
>> solution would work at all for you. If you only ever use a few
>> patterns, it will work. If you use many different patterns or don't
>> know before-hand which patterns will be used, it won't work well at
>> all.
>>
>> > The main difference is: the fast query looks like
>> >
>> > explain select items.num, wantcode from items, n where code =3D wantcode;
>> > =A0Merge Join =A0(cost=3D53.56..1104.02 rows=3D39178 width=3D36)
>> > =A0 Merge Cond: (("outer".code)::text =3D "inner".wantcode)
>> > =A0 -> =A0Index Scan using itemsc on items =A0(cost=3D0.00..438.75 rows=
>> =3D9614 width=3D42)
>> > =A0 -> =A0Sort =A0(cost=3D53.56..55.60 rows=3D815 width=3D32)
>> > =A0 =A0 =A0 =A0 Sort Key: n.wantcode
>> > =A0 =A0 =A0 =A0 -> =A0Seq Scan on n =A0(cost=3D0.00..14.15 rows=3D815 wid=
>> th=3D32)
>>
>> Is there an index on wantcode? If you have a million or more records,
>> I would expect an index scan for a measly 815 matches...
>>
>> > and the slow ones looks like that one:
>> >
>> > =A0Nested Loop =A0(cost=3D14.15..176478.01 rows=3D39178 width=3D36)
>> > =A0 Join Filter: (("outer".code)::text ~ "inner".wantcode)
>> >
>> > So the database takes an entirely differnet approach at retrieving the en=
>> tries.
>>
>> Yes, because you're still using ~ there, with a pattern that's unknown
>> at query planning time. That will only be fast under some fairly rare
>> circumstances.

Hi,

the problem is that I read the patterns from a file, as part of the script. They are not
known seperately. So it seems that creating the extra column is just the same effort as
running the original query. The processing is one-time in nature.
The one thing I can do is selecting a range of items on a common prefix, if all the
codes in the second table have some characters in common

>> Is there an index on wantcode? If you have a million or more records,
>> I would expect an index scan for a measly 815 matches..

When I ran a test, there was no real difference with wantcode indexed or not
It was interesting to try another suggestion and noting the difference between comparison
functions, with identical (lack of) use of an index

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

"David Johnston"

Date:

23 September 2011, 13:48:04

explain analyze select num, n.wantcode from items, n where items.code ~ n.wantcode;
Nested Loop (cost=20.00..216502.14 rows=48070 width=36) (actual time=148.479..336280.488 rows=2871 loops=1)
Join Filter: (("outer".code)::text ~ "inner".wantcode)
-> Seq Scan on items (cost=0.00..167.14 rows=9614 width=42) (actual time=0.048..38.666 rows=9614 loops=1)
-> Materialize (cost=20.00..30.00 rows=1000 width=32) (actual time=0.001..1.049 rows=815 loops=9614)
-> Seq Scan on n (cost=0.00..20.00 rows=1000 width=32) (actual time=0.003..1.839 rows=815 loops=1)
Total runtime: 336286.692 ms

I’ve only been partially following this thread but did you try something like:

WHERE items.code ~ (‘^’ || n.wantcode)

Not sure if this will be valid for your needs but the issue is that PostgreSQL cannot rely on an index for non-anchored search patterns and your compare-to data rightly does not contain regex meta-characters. If you explicitly indicate that the input expression is going to be anchored would PostgreSQL then realize it can use the index?

Not Tested.

David J.

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

23 September 2011, 14:23:28

At 09:45 23/09/2011, hamann.w@t-online.de wrote:

>A single anchored query
>select * from items where code ~ '^ABC';
>does indeed use the index to retrieve data.
>
>
>So I wonder whether there might be a different approach to this
>problem rather than
>pattern matching.
>I recall I had a similar problem before with a "contacts" column
>possibly containing one or more
>email addresses. Here searches would also be number of people times
>number of requests
>performance. I finally ended up with a @@ match (contrib/tsquery)
>and a supporting GIST index,
>but that only supports exact match, not prefix

You can try these, i doubt they will use any index but its a
different approach:

select * from items where length(items.code)<>length(rtrim(items.code,'ABC'));

select * from items where strpos(items.code,'ABC')=0 or
strpos(items.code,'any_substring')=0;

HTH

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

24 September 2011, 11:28:32

At 14:12 23/09/2011, hamann.w@t-online.de wrote:
>Eduardo Morras wrote:
>
> >> You can try these, i doubt they will use any index but its a
> >> different approach:
> >>
> >> select * from items where
> length(items.code)<>length(rtrim(items.code,'ABC'));
> >>
> >> select * from items where strpos(items.code,'ABC')=0 or
> >> strpos(items.code,'any_substring')=0;
>
>Hi,
>
>if I understand this right, it does not mean "check if the string
>appears at position 0"
>which could translate into an index query, but rather "check if the
>string appears anywhere
>and then check if that is position 0", so the entire table is checked.

The second one yes, as it checks all patterns you want only one time
per row they only needs one table scan. The first one eliminates the
substring 'ABC' from the string, if the lengths of both strings are
equal, the substring 'ABC' wasn't in it. If they are different, the
trimmed string will be shorter.

>explain analyze select items.num, wantcode from items, n where
>strpos(code, wantcode) = 0;
>  Nested Loop  (cost=167.14..196066.54 rows=39178 width=36) (actual
> time=0.074..36639.312 rows=7832539 loops=1)
>    Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
>    ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual
> time=0.005..2.212 rows=815 loops=1)
>    ->  Materialize  (cost=167.14..263.28 rows=9614 width=42)
> (actual time=0.007..13.970 rows=9614 loops=815)
>          ->  Seq Scan on items  (cost=0.00..167.14 rows=9614
> width=42) (actual time=0.044..14.855 rows=9614 loops=1)
>  Total runtime: 46229.836 ms
>
>
>The query ran much faster than the pattern query, however. This
>seems to be the performance
>of just searching for a plain string vs. initializing the regex
>engine every time (for 815
>queries in a test set)

It will do only one table scan while your original code will do one
for each substring you want to test. You can add more and more
substrings without too much cost. If you want to use the regex engine
instead the postgresql string funtions check the regexp_matches(), it
should be faster if you have 3000 substrings.

select * from items where regexp_matches(items.code,'(ABC) (DE1)
(any_substring)')<>{};

>Regards
>Wolfgang Hamann

HTH

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

24 September 2011, 18:28:06

At 14:12 23/09/2011, hamann.w@t-online.de wrote:
>Eduardo Morras wrote:
>
> >> You can try these, i doubt they will use any index but its a
> >> different approach:
> >>
> >> select * from items where
> length(items.code)<>length(rtrim(items.code,'ABC'));
> >>
> >> select * from items where strpos(items.code,'ABC')=0 or
> >> strpos(items.code,'any_substring')=0;
>
>Hi,
>
>if I understand this right, it does not mean "check if the string
>appears at position 0"
>which could translate into an index query, but rather "check if the
>string appears anywhere
>and then check if that is position 0", so the entire table is checked.

The second one yes, as it checks all patterns you want only one time
per row they only needs one table scan. The first one eliminates the
substring 'ABC' from the string, if the lengths of both strings are
equal, the substring 'ABC' wasn't in it. If they are different, the
trimmed string will be shorter.

>explain analyze select items.num, wantcode from items, n where
>strpos(code, wantcode) = 0;
>  Nested Loop  (cost=167.14..196066.54 rows=39178 width=36) (actual
> time=0.074..36639.312 rows=7832539 loops=1)
>    Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
>    ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual
> time=0.005..2.212 rows=815 loops=1)
>    ->  Materialize  (cost=167.14..263.28 rows=9614 width=42)
> (actual time=0.007..13.970 rows=9614 loops=815)
>          ->  Seq Scan on items  (cost=0.00..167.14 rows=9614
> width=42) (actual time=0.044..14.855 rows=9614 loops=1)
>  Total runtime: 46229.836 ms
>
>
>The query ran much faster than the pattern query, however. This
>seems to be the performance
>of just searching for a plain string vs. initializing the regex
>engine every time (for 815
>queries in a test set)

It will do only one table scan while your original code will do one
for each substring you want to test. You can add more and more
substrings without too much cost. If you want to use the regex engine
instead the postgresql string funtions check the regexp_matches(), it
should be faster if you have 3000 substrings.

select * from items where regexp_matches(items.code,'(ABC) (DE1)
(any_substring)')<>{};

>Regards
>Wolfgang Hamann

HTH

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

25 September 2011, 03:21:13

Eduardo Morras wrote:

>> >
>> >Hi,
>> >
>> >if I understand this right, it does not mean "check if the string
>> >appears at position 0"
>> >which could translate into an index query, but rather "check if the
>> >string appears anywhere
>> >and then check if that is position 0", so the entire table is checked.
>>
>> The second one yes, as it checks all patterns you want only one time
>> per row they only needs one table scan. The first one eliminates the
>> substring 'ABC' from the string, if the lengths of both strings are
>> equal, the substring 'ABC' wasn't in it. If they are different, the
>> trimmed string will be shorter.
>>
>> >explain analyze select items.num, wantcode from items, n where
>> >strpos(code, wantcode) = 0;
>> >  Nested Loop  (cost=167.14..196066.54 rows=39178 width=36) (actual
>> > time=0.074..36639.312 rows=7832539 loops=1)
>> >    Join Filter: (strpos(("inner".code)::text, "outer".wantcode) = 0)
>> >    ->  Seq Scan on n  (cost=0.00..14.15 rows=815 width=32) (actual
>> > time=0.005..2.212 rows=815 loops=1)
>> >    ->  Materialize  (cost=167.14..263.28 rows=9614 width=42)
>> > (actual time=0.007..13.970 rows=9614 loops=815)
>> >          ->  Seq Scan on items  (cost=0.00..167.14 rows=9614
>> > width=42) (actual time=0.044..14.855 rows=9614 loops=1)
>> >  Total runtime: 46229.836 ms
>> >
>> >
>> >The query ran much faster than the pattern query, however. This
>> >seems to be the performance
>> >of just searching for a plain string vs. initializing the regex
>> >engine every time (for 815
>> >queries in a test set)
>>
>> It will do only one table scan while your original code will do one
>> for each substring you want to test. You can add more and more
>> substrings without too much cost. If you want to use the regex engine
>> instead the postgresql string funtions check the regexp_matches(), it
>> should be faster if you have 3000 substrings.
>>
>> select * from items where regexp_matches(items.code,'(ABC) (DE1)
>> (any_substring)')<>{};
>>

Hi Eduardo,

it is clear that scanning the table once with a list of matches will outperform
rescanning the table for every string wanted. Now, my problem is that the patterns are
dynamic as well. So if I could translate a table with one column  and a few thousand rows
into something like
regexp_matches(code,'string1|string2|.....string2781')
would ideally be a performant query. Unfortunately I have no idea how I could achieve this
transformation inside the database. Doing it externally fails, because any single query cannot
be more than so few characters.

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

Alban Hertroys

Date:

25 September 2011, 08:40:46

On 25 Sep 2011, at 8:04, hamann.w@t-online.de wrote:
> Hi Eduardo,
>
> it is clear that scanning the table once with a list of matches will outperform
> rescanning the table for every string wanted. Now, my problem is that the patterns are
> dynamic as well. So if I could translate a table with one column  and a few thousand rows
> into something like
> regexp_matches(code,'string1|string2|.....string2781')
> would ideally be a performant query. Unfortunately I have no idea how I could achieve this
> transformation inside the database. Doing it externally fails, because any single query cannot
> be more than so few characters.

To me it sounds a little bit like you're comparing every item in a warehouse to a set of descriptions to see what type
ofitem it is, which is something you would be much better off storing as a property of the item. If an item is a fruit,
storethat it's a fruit! 
But I'm guessing at what you're trying to accomplish, so here's a few other options...

I guess you could create 2781 expression indexes to do what you want (is there a limit that prevents this?). Query
planningwould probably become kind of slow and the indices will take up a considerable fraction of the total table
storagerequired - that's a pretty outlandish approach. 

CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string1'));
CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2'));
...
CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2781'));

Or are you really going to query every record against all 2781 regexes? I can't figure out a realistic scenario why you
(oranyone) would want that. 
In that case those indices aren't going to help you much, as the planner would have to hold every record in tbl to each
index- it won't do that. 

You could also create a giant lookup table (a materialized view, if you like) where the results of every match of str
intbl against the wantcode in the regex table is stored. That's some huge overhead, but it will probably outperform
mostother options. With the numbers you gave that table will hold about 2-3 billion records with two foreign key values
anda truth value each. 

Alban Hertroys

--
The scale of a problem often equals the size of an ego.

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

25 September 2011, 16:10:33

At 08:04 25/09/2011, hamann.w@t-online.de wrote:

> >> select * from items where regexp_matches(items.code,'(ABC) (DE1)
> >> (any_substring)')<>{};
> >>
>
>Hi Eduardo,
>
>it is clear that scanning the table once with a list of matches will
>outperform
>rescanning the table for every string wanted. Now, my problem is
>that the patterns are
>dynamic as well. So if I could translate a table with one
>column  and a few thousand rows
>into something like
>regexp_matches(code,'string1|string2|.....string2781')
>would ideally be a performant query. Unfortunately I have no idea
>how I could achieve this
>transformation inside the database. Doing it externally fails,
>because any single query cannot
>be more than so few characters.

You can create a plsql function and pass a setof text that do it.
Sorry but instead saying you What Must You Type, WMYT(c), i prefer
the How Should You Do way, HSYD(c). Note that you can get the same
results using other approachs (f.ex. using FTS described in chapter 12)

Check this topics:

Function
Creation  http://www.postgresql.org/docs/9.0/static/sql-createfunction.html

Tutorial about Function
Creation  http://www.adderpit.com/practical-postgresql/x10374.htm

HTH

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

25 September 2011, 23:09:36

At 08:04 25/09/2011, hamann.w@t-online.de wrote:

> >> select * from items where regexp_matches(items.code,'(ABC) (DE1)
> >> (any_substring)')<>{};
> >>
>
>Hi Eduardo,
>
>it is clear that scanning the table once with a list of matches will
>outperform
>rescanning the table for every string wanted. Now, my problem is
>that the patterns are
>dynamic as well. So if I could translate a table with one
>column  and a few thousand rows
>into something like
>regexp_matches(code,'string1|string2|.....string2781')
>would ideally be a performant query. Unfortunately I have no idea
>how I could achieve this
>transformation inside the database. Doing it externally fails,
>because any single query cannot
>be more than so few characters.

You can create a plsql function and pass a setof text that do it.
Sorry but instead saying you What Must You Type, WMYT(c), i prefer
the How Should You Do way, HSYD(c). Note that you can get the same
results using other approachs (f.ex. using FTS described in chapter 12)

Check this topics:

Function
Creation  http://www.postgresql.org/docs/9.0/static/sql-createfunction.html

Tutorial about Function
Creation  http://www.adderpit.com/practical-postgresql/x10374.htm

HTH

Re: looking for a faster way to do that

From

Alban Hertroys

Date:

26 September 2011, 03:57:26

You forgot to include the list ;)

On 26 Sep 2011, at 6:06, hamann.w@t-online.de wrote:

> Alban Hertroys wrote:
>>>
>>> To me it sounds a little bit like you're comparing every item in a =
>>> warehouse to a set of descriptions to see what type of item it is, which =
>>> is something you would be much better off storing as a property of the =
>>> item. If an item is a fruit, store that it's a fruit!
>>> But I'm guessing at what you're trying to accomplish, so here's a few =
>>> other options...
>>>
>>> I guess you could create 2781 expression indexes to do what you want (is =
>>> there a limit that prevents this?). Query planning would probably become =
>>> kind of slow and the indices will take up a considerable fraction of the =
>>> total table storage required - that's a pretty outlandish approach.
>>>
>>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string1'));
>>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2'));
>>> ...
>>> CREATE INDEX tbl_str_regex1 ON tbl (regexp_matches(str, 'string2781'));
>
> Hi,
>
> the strings are not really known before.
> Let me explain the scenario; there is one table about products, and code is the
> manufacturer's (or resellers') product id.
> So, if ABC were a maker of laptops, ABC123 and ABC456 might be two different machines,
> and ABC123G might have G3 mobile installed, or ABC123X might be the same thing
> with extra memory. Obviously these device variants all look the same.
> Now reseller sends us a collection of product images, so there would be ABC123.jpg
> and ABC456.jpg
> The database task at hand is matching products to images (and then inserting the image
> name into a column of the products table).

I guessed right then. The origin of your problem is that you have similar items in your database, but the database
doesn'tknow they are similar. I'd fix that first, it makes the problem a whole lot easier to handle. 

For example, if an image comes in named ABC123G.jpg, you look up the product and manufacturer and update its image.
Thenyou query for products of the same manufacturer that are similar to ABC123G (result: ABC123 and ABC123X) and update
theirimages as well (if appropriate; perhaps they have a recent enough image of their own?). 

As another whacky alternative to your regular expressions; I think it would be possible to abuse the text-search
functionalityin Postgres to match product id's. Those id's are basically a language per manufacturer describing product
details.

If you can split the product id's up into lexemes that describe the base product id and it's options, then you can use
full-textsearch to match up expressions similar to the lexemes derived from the image name. 

For example:
 productid | lexemes
-----------+------------------
 ABC123    | {'ABC' '123'}
 ABC123G   | {'ABC' '123' 'G'}
 ABC123X   | {'ABC' '123' 'X'}
 ABC456    | {'ABC' '456'}

I'm not really sure if that's possible, or how much work it would be per manufacturer - I haven't used FTS much.

I'd first see if I couldn't add that similarity information to the products table, though ;)

Alban Hertroys

--
The scale of a problem often equals the size of an ego.

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

26 September 2011, 13:18:21

Eduardo Morras <nec556@retena.com> wrote:

>>
>> At 08:04 25/09/2011, hamann.w@t-online.de wrote:
>>
>> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1)
>> > >> (any_substring)')<>{};
>> > >>
>> >
>> >Hi Eduardo,
>> >
>> >it is clear that scanning the table once with a list of matches will
>> >outperform
>> >rescanning the table for every string wanted. Now, my problem is
>> >that the patterns are
>> >dynamic as well. So if I could translate a table with one
>> >column  and a few thousand rows
>> >into something like
>> >regexp_matches(code,'string1|string2|.....string2781')
>> >would ideally be a performant query. Unfortunately I have no idea
>> >how I could achieve this
>> >transformation inside the database. Doing it externally fails,
>> >because any single query cannot
>> >be more than so few characters.
>>
>> You can create a plsql function and pass a setof text that do it.
>> Sorry but instead saying you What Must You Type, WMYT(c), i prefer
>> the How Should You Do way, HSYD(c). Note that you can get the same
>> results using other approachs (f.ex. using FTS described in chapter 12)
>>
>> Check this topics:
>>
>> Function
>> Creation  http://www.postgresql.org/docs/9.0/static/sql-createfunction.html
>>
>> Tutorial about Function
>> Creation  http://www.adderpit.com/practical-postgresql/x10374.htm
>>

Hi,

I tried the pl/sql approach to convert the contents of that patterns table into a regex.
Results: 40 seconds runtime for 9500 candidates and 815 patterns
718 seconds for the same set of 9500 candidates, but using 4000 patterns instead.
So it seems that I am reaching limits of pattern match


As for the fulltext index (and the underlying tsquery): this is an exact match rather than prefix
match, so I would need to know match patterns in advance in order to build the index

I am thinking about that anyway (because ABC1234 likely should not match ABC123 pattern
in my context), but I would sort of prefer a system where I can state the rules when I
see the data set, rather than having to pre-create an index.

Thanks for the tutorial link :)
It seems that the responses on my post give all sorts of input that will help me on other
tasks

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

hamann.w@t-online.de

Date:

26 September 2011, 13:18:25

Alban Hertroys <haramrae@gmail.com> wrote:

>> > Hi,
>> >=20
>> > the strings are not really known before.
>> > Let me explain the scenario; there is one table about products, and =
>> code is the
>> > manufacturer's (or resellers') product id.
>> > So, if ABC were a maker of laptops, ABC123 and ABC456 might be two =
>> different machines,
>> > and ABC123G might have G3 mobile installed, or ABC123X might be the =
>> same thing
>> > with extra memory. Obviously these device variants all look the same.
>> > Now reseller sends us a collection of product images, so there would =
>> be ABC123.jpg
>> > and ABC456.jpg
>> > The database task at hand is matching products to images (and then =
>> inserting the image
>> > name into a column of the products table).
>>
>>
>> I guessed right then. The origin of your problem is that you have =
>> similar items in your database, but the database doesn't know they are =
>> similar. I'd fix that first, it makes the problem a whole lot easier to =
>> handle.
>>
>> For example, if an image comes in named ABC123G.jpg, you look up the =
>> product and manufacturer and update its image. Then you query for =
>> products of the same manufacturer that are similar to ABC123G (result: =
>> ABC123 and ABC123X) and update their images as well (if appropriate; =
>> perhaps they have a recent enough image of their own?).
>>
>>
>> As another whacky alternative to your regular expressions; I think it =
>> would be possible to abuse the text-search functionality in Postgres to =
>> match product id's. Those id's are basically a language per manufacturer =
>> describing product details.
>>
>> If you can split the product id's up into lexemes that describe the base =
>> product id and it's options, then you can use full-text search to match =
>> up expressions similar to the lexemes derived from the image name.
>>
>> For example:
>>  productid | lexemes
>> -----------+------------------
>>  ABC123    | {'ABC' '123'}
>>  ABC123G   | {'ABC' '123' 'G'}
>>  ABC123X   | {'ABC' '123' 'X'}
>>  ABC456    | {'ABC' '456'}
>>
>> I'm not really sure if that's possible, or how much work it would be per =
>> manufacturer - I haven't used FTS much.
>>
>> I'd first see if I couldn't add that similarity information to the =
>> products table, though ;)
>>
>> Alban Hertroys
>>
>> --
>> The scale of a problem often equals the size of an ego.
>>
>>

Hi,

the actual process tends to be
- add products to database
- then receive images and try to match them to products.

So I know about the naming scheme only when I see a list of files, and would want to have
a cmdline option for my matching script that distinguishes formats like
a) exact
b) alpha suffix following numeric main body
c) period or slash between main and related
To make things even more complex, I might receive images from a reseller that offers
a few manufacturers using different conventions.

I wonder whether this would translate well into building a temporary index, if I detect b or c
patterns.

When I asked first, I also started working on a different approach. This did work on the
current case (I got similar datasets before, but with considerably fewer items), so I will
probably try that and some ideas I got from this discussion, and see how far I get.
This approach is a separate perl script that builds a tree structure of all the image names
and then tries to create a regex for a crude filter. In the example it would have
determined that all images in the lot match an ABC prefix. Then it selects all matching
codes (so it can work with the entire database) and runs them through the same tree
structure.

Regards
Wolfgang Hamann

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

26 September 2011, 14:02:12

At 18:18 26/09/2011, you wrote:

>Eduardo Morras <nec556@retena.com> wrote:
>
> >>
> >> At 08:04 25/09/2011, hamann.w@t-online.de wrote:
> >>
> >> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1)
> >> > >> (any_substring)')<>{};
> >> > >>
> >> >
> >> >Hi Eduardo,
> >> >
> >> >it is clear that scanning the table once with a list of matches will
> >> >outperform
> >> >rescanning the table for every string wanted. Now, my problem is
> >> >that the patterns are
> >> >dynamic as well. So if I could translate a table with one
> >> >column  and a few thousand rows
> >> >into something like
> >> >regexp_matches(code,'string1|string2|.....string2781')
> >> >would ideally be a performant query. Unfortunately I have no idea
> >> >how I could achieve this
> >> >transformation inside the database. Doing it externally fails,
> >> >because any single query cannot
> >> >be more than so few characters.
> >>
> >> You can create a plsql function and pass a setof text that do it.
> >> Sorry but instead saying you What Must You Type, WMYT(c), i prefer
> >> the How Should You Do way, HSYD(c). Note that you can get the same
> >> results using other approachs (f.ex. using FTS described in chapter 12)
> >>
> >> Check this topics:
> >>
> >> Function
> >>
> Creation  http://www.postgresql.org/docs/9.0/static/sql-createfunction.html
> >>
> >> Tutorial about Function
> >> Creation  http://www.adderpit.com/practical-postgresql/x10374.htm
> >>
>
>Hi,
>
>I tried the pl/sql approach to convert the contents of that patterns
>table into a regex.
>Results: 40 seconds runtime for 9500 candidates and 815 patterns
>718 seconds for the same set of 9500 candidates, but using 4000
>patterns instead.
>So it seems that I am reaching limits of pattern match

Perhaps calling the function twice with half the values go faster.
How do you call the function? EXECUTE or SELECT? If you use EXECUTE
then the prepared plan in a previous call is ignored and is usually
faster. Don't know if in your case it run faster but you can try it.

>As for the fulltext index (and the underlying tsquery): this is an
>exact match rather than prefix
>match, so I would need to know match patterns in advance in order to
>build the index
>
>I am thinking about that anyway (because ABC1234 likely should not
>match ABC123 pattern
>in my context), but I would sort of prefer a system where I can
>state the rules when I
>see the data set, rather than having to pre-create an index.
>
>Thanks for the tutorial link :)

It's the one i used time ago. A bit old but very good one.

>It seems that the responses on my post give all sorts of input that
>will help me on other
>tasks

>Regards
>Wolfgang Hamann

Re: looking for a faster way to do that

From

Eduardo Morras

Date:

26 September 2011, 20:56:43

At 18:18 26/09/2011, you wrote:

>Eduardo Morras <nec556@retena.com> wrote:
>
> >>
> >> At 08:04 25/09/2011, hamann.w@t-online.de wrote:
> >>
> >> > >> select * from items where regexp_matches(items.code,'(ABC) (DE1)
> >> > >> (any_substring)')<>{};
> >> > >>
> >> >
> >> >Hi Eduardo,
> >> >
> >> >it is clear that scanning the table once with a list of matches will
> >> >outperform
> >> >rescanning the table for every string wanted. Now, my problem is
> >> >that the patterns are
> >> >dynamic as well. So if I could translate a table with one
> >> >column  and a few thousand rows
> >> >into something like
> >> >regexp_matches(code,'string1|string2|.....string2781')
> >> >would ideally be a performant query. Unfortunately I have no idea
> >> >how I could achieve this
> >> >transformation inside the database. Doing it externally fails,
> >> >because any single query cannot
> >> >be more than so few characters.
> >>
> >> You can create a plsql function and pass a setof text that do it.
> >> Sorry but instead saying you What Must You Type, WMYT(c), i prefer
> >> the How Should You Do way, HSYD(c). Note that you can get the same
> >> results using other approachs (f.ex. using FTS described in chapter 12)
> >>
> >> Check this topics:
> >>
> >> Function
> >>
> Creation  http://www.postgresql.org/docs/9.0/static/sql-createfunction.html
> >>
> >> Tutorial about Function
> >> Creation  http://www.adderpit.com/practical-postgresql/x10374.htm
> >>
>
>Hi,
>
>I tried the pl/sql approach to convert the contents of that patterns
>table into a regex.
>Results: 40 seconds runtime for 9500 candidates and 815 patterns
>718 seconds for the same set of 9500 candidates, but using 4000
>patterns instead.
>So it seems that I am reaching limits of pattern match

Perhaps calling the function twice with half the values go faster.
How do you call the function? EXECUTE or SELECT? If you use EXECUTE
then the prepared plan in a previous call is ignored and is usually
faster. Don't know if in your case it run faster but you can try it.

>As for the fulltext index (and the underlying tsquery): this is an
>exact match rather than prefix
>match, so I would need to know match patterns in advance in order to
>build the index
>
>I am thinking about that anyway (because ABC1234 likely should not
>match ABC123 pattern
>in my context), but I would sort of prefer a system where I can
>state the rules when I
>see the data set, rather than having to pre-create an index.
>
>Thanks for the tutorial link :)

It's the one i used time ago. A bit old but very good one.

>It seems that the responses on my post give all sorts of input that
>will help me on other
>tasks

>Regards
>Wolfgang Hamann