Thread: Indexing and regular expressions

Indexing and regular expressions

From

Kjartan Ásþórsson

Date:

07 April 2002, 06:09:21

Is there any indexing technique available I can use when joining tables
with a regular expression pattern in pgsql?

I know one method for indexing strings that will be matched with regular
expression patterns, and that is using so called k-gram indexes.
Indexing the string "kjartan" with k-gram index where k = 3 would
create "kja", "jar", "art", "rta", "tan" as an index. Ofcourse it is hard to
decide the size of k and I'm sure in many cases mulitple k values might
be needed, depending on the situation.

I have not done any major survey of available techniques, but I was
hoping I could get some pointers here.

I assume pgsql always uses nested loop join when joining relations which are
joined with regular expression pattern?
-- 
Kjartan Ásþórsson
http://www.kjarri.net 
Tel: +46 (0)730 556705

Re: Indexing and regular expressions

From

Oleg Bartunov

Date:

07 April 2002, 06:54:03

On Sun, 7 Apr 2002, [ISO-8859-1] Kjartan аsЧСrsson wrote:

> Is there any indexing technique available I can use when joining tables
> with a regular expression pattern in pgsql?
>
> I know one method for indexing strings that will be matched with regular
> expression patterns, and that is using so called k-gram indexes.
> Indexing the string "kjartan" with k-gram index where k = 3 would
> create "kja", "jar", "art", "rta", "tan" as an index. Ofcourse it is hard to

Usually, k-grams technique is used to match patterns with errors and
3-grams produce "__k", "_kj", "kja", "jar", "art", "rta", "ta_", "a__"
where leading and trailing spaces are used to compensate 'boundary' effect.

But I dont' quite understand your question. Are you looking for fuzzy match ?
If so, take a look on contrib modules.

> decide the size of k and I'm sure in many cases mulitple k values might
> be needed, depending on the situation.
>
> I have not done any major survey of available techniques, but I was
> hoping I could get some pointers here.
>
> I assume pgsql always uses nested loop join when joining relations which are
> joined with regular expression pattern?
>
Regards,    Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Indexing and regular expressions

From

Kjartan Бsюуrsson

Date:

07 April 2002, 07:48:45

Thank you for your reply.

No, I am not looking for a fuzzy match. I am simply wondering if there
are some methods available that can speed up joining of tables when
the join is done with a regular expression operator (one table
contains regular expression patterns, and the other strings that
should be matched against the pattern).

If no indexes are used, a full table scan of the string table is
needed if we want to select all strings that matches a given pattern.

My question is if there are any indexing methods available that can
be used to prone the search space in the string table when using
the regular expression operator.

I hope I made myself more clear now. Best regards,
Kjartan


Sunday, April 7, 2002, 12:54:27 PM, you wrote:

> On Sun, 7 Apr 2002, [ISO-8859-1] Kjartan аsЧСrsson wrote:

>> Is there any indexing technique available I can use when joining tables
>> with a regular expression pattern in pgsql?
>>
>> I know one method for indexing strings that will be matched with regular
>> expression patterns, and that is using so called k-gram indexes.
>> Indexing the string "kjartan" with k-gram index where k = 3 would
>> create "kja", "jar", "art", "rta", "tan" as an index. Ofcourse it is hard to

> Usually, k-grams technique is used to match patterns with errors and
> 3-grams produce "__k", "_kj", "kja", "jar", "art", "rta", "ta_", "a__"
> where leading and trailing spaces are used to compensate 'boundary' effect.

> But I dont' quite understand your question. Are you looking for fuzzy match ?
> If so, take a look on contrib modules.

>> decide the size of k and I'm sure in many cases mulitple k values might
>> be needed, depending on the situation.
>>
>> I have not done any major survey of available techniques, but I was
>> hoping I could get some pointers here.
>>
>> I assume pgsql always uses nested loop join when joining relations which are
>> joined with regular expression pattern?
>>

>         Regards,
>                 Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83


> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Indexing and regular expressions

From

Neil Conway

Date:

07 April 2002, 15:08:33

On Sun, 7 Apr 2002 13:49:08 +0200
"Kjartan " <a98kjaas@student.his.se> wrote:
> Thank you for your reply.
> 
> No, I am not looking for a fuzzy match. I am simply wondering if there
> are some methods available that can speed up joining of tables when
> the join is done with a regular expression operator (one table
> contains regular expression patterns, and the other strings that
> should be matched against the pattern).
> 
> If no indexes are used, a full table scan of the string table is
> needed if we want to select all strings that matches a given pattern.

If the pattern is arbitrary, I'm not sure that any indexing technique
will be able to significantly improve performance (or at least, I'd
be *very* interested to see such a technique).

If you're storing the regular expressions in another table, you
could also store the pre-compiled patterns and use those for a
seqscan, but that won't net you a large improvement, particularly
if your set of patterns is much smaller than your set of strings
to match against (in which case the time to compile the pattern
becomes insignificant).

If your regular expressions contain common sub-elements (e.g.
many of them include "match string beginning with xxx" or whatever),
you could perhaps use indexes to optimize those sub-elements,
and then run the rest of the pattern on the tuples found by the
index. But if your patterns are truly arbitrary, this will be
unlikely to help.

Therefore, the answer is no AFAICT -- regular expressions are too
flexible to allow for optimization in advance.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC