Thread: text_position worst case runtime

text_position worst case runtime

From
Mark Dilger
Date:
The function
 static int32 text_position(text *t1, text *t2, int matchnum)

defined in src/backend/utils/adt/varlena.c uses repeated calls to strncmp (or
pg_wchar_strncmp) to find the location of the pattern in the text.  The worst
case runtime for such an approach is O(n*m) where n and m are the lengths of the
pattern and text.  The best case would be O(n), I guess, because it only takes n
comparisons to find the pattern at the very start of the text.  I'm not sure how
to determine the average case runtime, because it depends what your data looks
like on average.

The fastest worst-case algorithmic solutions (boyer-moore and similar) have an
O(n+m) worst-case runtime.

In practice, the current iterative strncmp solution may be faster than the
boyer-moore solution for average data, because the data may not be constructed
in such a way as to trigger worst-case or nearly worst-case run times.  The
current solution also has the advantage of not requiring a palloc of O(n) space
for the pattern preprocessing that boyer-moore requires.

My question is, would the core development team entertain the idea of changing
this function if I supplied the new code?

Thanks,

mark


Re: text_position worst case runtime

From
Tom Lane
Date:
Mark Dilger <pgsql@markdilger.com> writes:
> The function
>   static int32 text_position(text *t1, text *t2, int matchnum)
> defined in src/backend/utils/adt/varlena.c uses repeated calls to
> strncmp (or pg_wchar_strncmp) to find the location of the pattern in
> the text.  The worst case runtime for such an approach is O(n*m) where
> n and m are the lengths of the pattern and text.  The best case would
> be O(n), I guess, because it only takes n comparisons to find the
> pattern at the very start of the text.  I'm not sure how to determine
> the average case runtime, because it depends what your data looks like
> on average.

I would think that the worst-case times would be fairly improbable.
I'm disinclined to push something as complicated as Boyer-Moore matching
into this function without considerable evidence that it's a performance
bottleneck for real applications.

The question that comes to mind for me is why we're not using simple
strncmp in all cases in that code.  The conversion to pg_wchar format
looks expensive and unnecessary ...
        regards, tom lane


Re: text_position worst case runtime

From
Mark Dilger
Date:
Tom Lane wrote:
> Mark Dilger <pgsql@markdilger.com> writes:
> 
>>The function
>>  static int32 text_position(text *t1, text *t2, int matchnum)
>>defined in src/backend/utils/adt/varlena.c uses repeated calls to
>>strncmp (or pg_wchar_strncmp) to find the location of the pattern in
>>the text.  The worst case runtime for such an approach is O(n*m) where
>>n and m are the lengths of the pattern and text.  The best case would
>>be O(n), I guess, because it only takes n comparisons to find the
>>pattern at the very start of the text.  I'm not sure how to determine
>>the average case runtime, because it depends what your data looks like
>>on average.
> 
> 
> I would think that the worst-case times would be fairly improbable.
> I'm disinclined to push something as complicated as Boyer-Moore matching
> into this function without considerable evidence that it's a performance
> bottleneck for real applications.

A common approach in biological data applications is to store nucleic and amino
acid sequences as text in a relational database.  The smaller alphabet sizes and
the tendency for redundancy in these sequences increases the likelihood of a
performance problem.  I have solved this problem by writing my own data types
with their own functions for sequence comparison and alignment, and I used
boyer-moore for some of that work.  Whether the same technique should be used
for the text and varchar types was unclear to me, hence the question.

> The question that comes to mind for me is why we're not using simple
> strncmp in all cases in that code.  The conversion to pg_wchar format
> looks expensive and unnecessary ...


Re: text_position worst case runtime

From
"Jim C. Nasby"
Date:
On Thu, May 18, 2006 at 06:49:38PM -0700, Mark Dilger wrote:
> > I would think that the worst-case times would be fairly improbable.
> > I'm disinclined to push something as complicated as Boyer-Moore matching
> > into this function without considerable evidence that it's a performance
> > bottleneck for real applications.
> 
> A common approach in biological data applications is to store nucleic and amino
> acid sequences as text in a relational database.  The smaller alphabet sizes and
> the tendency for redundancy in these sequences increases the likelihood of a
> performance problem.  I have solved this problem by writing my own data types
> with their own functions for sequence comparison and alignment, and I used
> boyer-moore for some of that work.  Whether the same technique should be used
> for the text and varchar types was unclear to me, hence the question.

Perhaps it would be best to add a seperate set of functions that use
boyer-moore, and reference them in appropriate places in the
documentation. Unless someone has a better idea on how we can find out
what people are actually doing in the field...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: text_position worst case runtime

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Perhaps it would be best to add a seperate set of functions that use
> boyer-moore, and reference them in appropriate places in the
> documentation. Unless someone has a better idea on how we can find out
> what people are actually doing in the field...

You've obviously missed the point of my concern, which is code bloat.
A parallel set of functions incorporating B-M would make things worse
not better from that standpoint.  (Unless you are proposing that someone
do it as a separate pgfoundry project; which'd be fine with me.  I'm
just concerned about how much we buy into as core features.)
        regards, tom lane


Re: text_position worst case runtime

From
Alvaro Herrera
Date:
Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Perhaps it would be best to add a seperate set of functions that use
> > boyer-moore, and reference them in appropriate places in the
> > documentation. Unless someone has a better idea on how we can find out
> > what people are actually doing in the field...
> 
> You've obviously missed the point of my concern, which is code bloat.
> A parallel set of functions incorporating B-M would make things worse
> not better from that standpoint.  (Unless you are proposing that someone
> do it as a separate pgfoundry project; which'd be fine with me.  I'm
> just concerned about how much we buy into as core features.)

So why not just replace our code with better algorithms?  We could use
Shift-Or or Shift-And which AFAIK are even better than Boyer-Moore.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: text_position worst case runtime

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> You've obviously missed the point of my concern, which is code bloat.

> So why not just replace our code with better algorithms?  We could use
> Shift-Or or Shift-And which AFAIK are even better than Boyer-Moore.

And how much code would those take?  The bottom line here is that we
don't have a pile of complaints about the performance of text_position,
so it's difficult to justify making it much more complicated than it
is now.
        regards, tom lane


Re: text_position worst case runtime

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> You've obviously missed the point of my concern, which is code bloat.
> 
> > So why not just replace our code with better algorithms?  We could use
> > Shift-Or or Shift-And which AFAIK are even better than Boyer-Moore.
> 
> And how much code would those take?  The bottom line here is that we
> don't have a pile of complaints about the performance of text_position,
> so it's difficult to justify making it much more complicated than it
> is now.

Even Boyer-Moore, while conceptually tricky isn't actually all that much code.

It seems somewhat contrary to the Postgres design philosophy to assume that
all strings are small.

Other databases have two different string data types, one that has a small
length limit (often only 255 bytes or so) and another that has all kinds of
awkward restrictions on how it can be used. Postgres allows text to contain
gigabytes of data and lets you use all the normal string functions on it. 

It seems like having those string functions assuming the strings are small
compromises that design choice.

-- 
greg



Re: text_position worst case runtime

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> And how much code would those take?  The bottom line here is that we
>> don't have a pile of complaints about the performance of text_position,
>> so it's difficult to justify making it much more complicated than it
>> is now.

> It seems somewhat contrary to the Postgres design philosophy to assume that
> all strings are small.

That is a straw-man argument.  If we try to optimize every single
function in the system to the Nth degree, we'll end up with a system
that is unmaintainable (and likely unusably buggy as well).  We've got
to set limits on the amount of complexity we're willing to accept in
the core code.

Note that I have not said "you can't put Boyer-Moore into core".
What I've said is that the case to justify doing that hasn't been made.
And handwaving about "design philosophy" isn't the kind of case I'm
looking for --- common applications in which it makes a real performance
difference are what I'm looking for.

At this point we haven't even been shown any evidence that text_position
itself is what to optimize if you need to do searches in large text
strings.  It seems entirely likely to me that the TOAST mechanisms would
be the bottleneck, instead.  And one should also consider other approaches
entirely, like indexes (tsearch2 anyone?).
        regards, tom lane


Re: text_position worst case runtime

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-05-19 kell 11:20, kirjutas Jim C. Nasby:
> On Thu, May 18, 2006 at 06:49:38PM -0700, Mark Dilger wrote:
> > > I would think that the worst-case times would be fairly improbable.
> > > I'm disinclined to push something as complicated as Boyer-Moore matching
> > > into this function without considerable evidence that it's a performance
> > > bottleneck for real applications.
> > 
> > A common approach in biological data applications is to store nucleic and amino
> > acid sequences as text in a relational database.  The smaller alphabet sizes and
> > the tendency for redundancy in these sequences increases the likelihood of a
> > performance problem.  I have solved this problem by writing my own data types
> > with their own functions for sequence comparison and alignment, and I used
> > boyer-moore for some of that work.  Whether the same technique should be used
> > for the text and varchar types was unclear to me, hence the question.
> 
> Perhaps it would be best to add a seperate set of functions that use
> boyer-moore, and reference them in appropriate places in the
> documentation. Unless someone has a better idea on how we can find out
> what people are actually doing in the field...

I guess our regex implementation already uses boyer-moore or similar.
Why not just expose the match position of substring('text' in 'regex')
using some function, called  match_position(int searched_text, int
regex, int matchnum) ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com





Re: text_position worst case runtime

From
Mark Dilger
Date:
Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> 
>>Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>>>And how much code would those take?  The bottom line here is that we
>>>don't have a pile of complaints about the performance of text_position,
>>>so it's difficult to justify making it much more complicated than it
>>>is now.
> 
> 
>>It seems somewhat contrary to the Postgres design philosophy to assume that
>>all strings are small.
> 
> 
> That is a straw-man argument.  If we try to optimize every single
> function in the system to the Nth degree, we'll end up with a system
> that is unmaintainable (and likely unusably buggy as well).  We've got
> to set limits on the amount of complexity we're willing to accept in
> the core code.
> 
> Note that I have not said "you can't put Boyer-Moore into core".
> What I've said is that the case to justify doing that hasn't been made.
> And handwaving about "design philosophy" isn't the kind of case I'm
> looking for --- common applications in which it makes a real performance
> difference are what I'm looking for.
> 
> At this point we haven't even been shown any evidence that text_position
> itself is what to optimize if you need to do searches in large text
> strings.  It seems entirely likely to me that the TOAST mechanisms would
> be the bottleneck, instead.  And one should also consider other approaches
> entirely, like indexes (tsearch2 anyone?).

In case anyone is following this thread specifically for the biological sequence
data aspect of it, I should mention that I wrote a GiST index for the dna and
protein sequence datatypes.  The performance of the index was inconsistent.  For
certain data, I could get about two orders of magnitude speed increase on
selects, where the select was based on a limited regular expression approximate
match against the data.  But if you change the regular expression (or to a
degree, if you change the data) the performance can drop off to roughly tied
with a sequential scan.  And of course, inserts are far more expensive because
the index has to be kept up to date.

If anyone wants specifics, send me an email and I'll put something together.

mark


Re: text_position worst case runtime

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> I guess our regex implementation already uses boyer-moore or similar.
> Why not just expose the match position of substring('text' in 'regex')
> using some function, called  match_position(int searched_text, int
> regex, int matchnum) ?

If it did that might be a nice solution, but I'm not sure that it does
use B-M ... I can't find either "Boyer" or "Moore" in its source code.

There's no particular reason to suppose offhand that a regex engine
would be faster than the naive code for fixed patterns.
        regards, tom lane


Re: text_position worst case runtime

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-05-19 kell 18:18, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > I guess our regex implementation already uses boyer-moore or similar.
> > Why not just expose the match position of substring('text' in 'regex')
> > using some function, called  match_position(int searched_text, int
> > regex, int matchnum) ?
> 
> If it did that might be a nice solution, but I'm not sure that it does
> use B-M ... I can't find either "Boyer" or "Moore" in its source code.

Ok, maybe it is not optimised for finding longish strings inside even
longers trings.

I had a (false ?) memory that we used some variant of pcre, and that
pcre uses BM. I may be false on both  accounts. (I know that python
borrowed its re module from pcre).

> There's no particular reason to suppose offhand that a regex engine
> would be faster than the naive code for fixed patterns.

if naive code is O(n*m), then starting from some values of n and m it is
probably faster if it is based on somewhat optimised regex engine, the
question is, what is the threasold and dataset for fasterness 

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: text_position worst case runtime

From
Hannu Krosing
Date:
Ühel kenal päeval, L, 2006-05-20 kell 01:34, kirjutas Hannu Krosing:
> Ühel kenal päeval, R, 2006-05-19 kell 18:18, kirjutas Tom Lane:
> > Hannu Krosing <hannu@skype.net> writes:
> > > I guess our regex implementation already uses boyer-moore or similar.
> > > Why not just expose the match position of substring('text' in 'regex')
> > > using some function, called  match_position(int searched_text, int
> > > regex, int matchnum) ?
> > 
> > If it did that might be a nice solution, but I'm not sure that it does
> > use B-M ... I can't find either "Boyer" or "Moore" in its source code.
> 
> Ok, maybe it is not optimised for finding longish strings inside even
> longers trings.
> 
> I had a (false ?) memory that we used some variant of pcre, and that
> pcre uses BM. I may be false on both  accounts. (I know that python
> borrowed its re module from pcre).

http://www.mcabee.org/lists/snort-users/Mar-05/msg00026.html

seems to imply that PCRE uses BM at least for some case, so I might not
have been wrong in case 2 :)


> > There's no particular reason to suppose offhand that a regex engine
> > would be faster than the naive code for fixed patterns.
> 
> if naive code is O(n*m), then starting from some values of n and m it is
> probably faster if it is based on somewhat optimised regex engine, the
> question is, what is the threasold and dataset for fasterness 
> 
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: text_position worst case runtime

From
Alvaro Herrera
Date:
Hannu Krosing wrote:

> I had a (false ?) memory that we used some variant of pcre, and that
> pcre uses BM. I may be false on both  accounts. (I know that python
> borrowed its re module from pcre).

Our code is a derivative from Henry Spencer's code found in Tcl.  It
certainly isn't Boyer Moore, because it processes chars one at a time,
left to right (BM processes right to left).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: text_position worst case runtime

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> If it did that might be a nice solution, but I'm not sure that it does
> use B-M ... I can't find either "Boyer" or "Moore" in its source code.
> 
> There's no particular reason to suppose offhand that a regex engine
> would be faster than the naive code for fixed patterns.

Well even a lame regexp implementation ought to be O(n+m). The factors will be
less than Boyer-Moore which can skip over substantial sections of the search
space without even looking at the characters. But there's no way it would be
O(n*m) for simple patterns unless the implementation was seriously deficient.

Of course your statement could still be true for particular usage patterns
like searching many different short strings with many different patterns where
the setup time of the regexp tables may dominate.

-- 
greg