Re: Some regular-expression performance hacking - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Some regular-expression performance hacking
Date
Msg-id 2872014.1613681087@sss.pgh.pa.us
Whole thread Raw
In response to Re: Some regular-expression performance hacking  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: Some regular-expression performance hacking
List pgsql-hackers
"Joel Jacobson" <joel@compiler.org> writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state transitions for single characters.
> If we could instead let a single edge (for a state transition) represent a set of characters,
> or normally even more efficiently, a set of range of characters, then we could reduce the
> number of edges we need to represent the graph.
> The naive approach to just use the ranges as-is doesn't work.
> Instead, the graph must first be created with single-character edges.
> It is then examined what ranges can be constructed in a way that no single range
> overlaps any other range, so that every range can be seen as a character in an alphabet.

Hmm ... I might be misunderstanding, but I think our engine already
does a version of this.  See the discussion of "colors" in
src/backend/regex/README.

> Another optimization I've come up with (or probably re-invented because it feels quite obvious),
> is to read more than one character, when knowing for sure multiple characters-in-a-row
> are expected, by concatenating edges having only one parent and one child.

Maybe.  In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea.  That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.

I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing.  So thanks again for collecting it!

            regards, tom lane



pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Some regular-expression performance hacking
Next
From: "Joel Jacobson"
Date:
Subject: Re: Some regular-expression performance hacking