Thread: large document multiple regex

large document multiple regex

From
"Merlin Moncure"
Date:
Hello,

I am receiving a large (300k+_ document from an external agent and
need to reduce a few interesting bits of data out of the document on
an insert trigger into separate fields.

regex seems one way to handle this but is there any way to avoid
rescanning the document for each regex.  One solution I am kicking
around is some C hackery but then I lose the expressive power of
regex.  Ideally, I need to be able to scan some text and return a
comma delimited string of values extracted from it.  Does anybody know
if this is possible or have any other suggestions?

merlin

Re: large document multiple regex

From
Jim Nasby
Date:
On Jan 26, 2007, at 9:06 AM, Merlin Moncure wrote:
> I am receiving a large (300k+_ document from an external agent and
> need to reduce a few interesting bits of data out of the document on
> an insert trigger into separate fields.
>
> regex seems one way to handle this but is there any way to avoid
> rescanning the document for each regex.  One solution I am kicking
> around is some C hackery but then I lose the expressive power of
> regex.  Ideally, I need to be able to scan some text and return a
> comma delimited string of values extracted from it.  Does anybody know
> if this is possible or have any other suggestions?

Have you thought about something like ~ '(first_string|second_string|
third_string)'? Obviously your example would be more complex, but I
believe that with careful crafting, you can get regex to do a lot
without resorting to multiple passes.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: large document multiple regex

From
"Merlin Moncure"
Date:
On 2/1/07, Jim Nasby <decibel@decibel.org> wrote:
> Have you thought about something like ~ '(first_string|second_string|
> third_string)'? Obviously your example would be more complex, but I
> believe that with careful crafting, you can get regex to do a lot
> without resorting to multiple passes.

that doesn't work...i researched the problem further and found that
postgresql regex implementation has the built in limitation to quit
scanning after the first matched group (this is noted in the
documentation).  There is no way that I can see to extract two or more
non contiguous text chunks in a single regex.

To do it properly, you need to have the sophistication of perl regex
with it's magic variables.

merlin

Re: large document multiple regex

From
David Fetter
Date:
On Fri, Feb 02, 2007 at 12:00:27PM -0500, Merlin Moncure wrote:
> On 2/1/07, Jim Nasby <decibel@decibel.org> wrote:
> >Have you thought about something like ~
> >'(first_string|second_string| third_string)'? Obviously your
> >example would be more complex, but I believe that with careful
> >crafting, you can get regex to do a lot without resorting to
> >multiple passes.
>
> that doesn't work...i researched the problem further and found that
> postgresql regex implementation has the built in limitation to quit
> scanning after the first matched group (this is noted in the
> documentation).  There is no way that I can see to extract two or
> more non contiguous text chunks in a single regex.
>
> To do it properly, you need to have the sophistication of perl regex
> with it's magic variables.

It looks like that's coming in 8.3 :)
<http://archives.postgresql.org/pgsql-hackers/2007-02/msg00039.php>

Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778        AIM: dfetter666
                              Skype: davidfetter

Remember to vote!