I wrote:
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second. This brings the
> runtime down from minutes to sub-second on my machine. However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?
The Tcl folk accepted that patch, so I went ahead and applied it to
our code. It would still be a good idea for us to do any testing we
can on it, though.
> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...
I looked at this a bit and decided that it would be messier than seems
justified in the absence of known problems. The regex code uses malloc
not palloc for transient data structures, so we can't just stick a
CHECK_FOR_INTERRUPTS() into some inner loop hotspot --- throwing a
longjmp would mean a permanent memory leak. I looked at integrating
into the error mechanism it does have, but it turns out that that's
not particularly designed to force immediate exit on failure either;
they just set a flag bit that will be tested whenever control does
return from the match. I think that a good fix would require first
changing the mallocs to pallocs, then add CHECK_FOR_INTERRUPTS.
But that's not something I have time to mess with at the moment.
regards, tom lane