Re: Patch: regexp_matches variant returning an array of matching positions - Mailing list pgsql-hackers

From David Johnston
Subject Re: Patch: regexp_matches variant returning an array of matching positions
Date
Msg-id 1390968987243-5789414.post@n5.nabble.com
Whole thread Raw
In response to Re: Patch: regexp_matches variant returning an array of matching positions  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: Re: Patch: regexp_matches variant returning an array of matching positions
Re: Patch: regexp_matches variant returning an array of matching positions
List pgsql-hackers
Alvaro Herrera-9 wrote
> Björn Harrtell wrote:
>> I've written a variant of regexp_matches called regexp_matches_positions
>> which instead of returning matching substrings will return matching
>> positions. I found use of this when processing OCR scanned text and
>> wanted
>> to prioritize matches based on their position.
>
> Interesting.  I didn't read the patch but I wonder if it would be of
> more general applicability to return more info in a fell swoop a
> function returning a set (position, length, text of match), rather than
> an array.  So instead of first calling one function to get the match and
> then their positions, do it all in one pass.
>
> (See pg_event_trigger_dropped_objects for a simple example of a function
> that returns in that fashion.  There are several others but AFAIR that's
> the simplest one.)

Confused as to your thinking. Like regexp_matches this returns "SETOF
type[]".  In this case integer but text for the matches.  I could see adding
a generic function that returns a SETOF named composite (match varchar[],
position int[], length int[]) and the corresponding type.  I'm not imagining
a situation where you'd want the position but not the text and so having to
evaluate the regexp twice seems wasteful.  The length is probably a waste
though since it can readily be gotten from the text and is less often
needed.  But if it's pre-calculated anyway...

My question is what position is returned in a multiple-match situation? The
supplied test only covers the simple, non-global, situation.  It needs to
exercise empty sub-matches and global searches.  One theory is that the
first array slot should cover the global position of match zero (i.e., the
entire pattern) within the larger document while sub-matches would be
relative offsets within that single match.  This conflicts, though, with the
fact that _matches only returns array elements for () items and never for
the full match - the goal in this function being parallel un-nesting. But as
nesting is allowed it is still possible to have occur.

How does this resolve in the patch?

SELECT regexp_matches('abcabc','((a)(b)(c))','g');

David J.







--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Patch-regexp-matches-variant-returning-an-array-of-matching-positions-tp5789321p5789414.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: updated emacs configuration
Next
From: Robert Haas
Date:
Subject: Re: Observed Compilation warning in WIN32 build