"Joel Jacobson" <joel@compiler.org> writes:
> Let's see if I can explain the idea:
> One of the problems with representing regexes with large bracket range expressions, like [a-z],
> is you get an explosion of edges, if the model can only represent state transitions for single characters.
> If we could instead let a single edge (for a state transition) represent a set of characters,
> or normally even more efficiently, a set of range of characters, then we could reduce the
> number of edges we need to represent the graph.
> The naive approach to just use the ranges as-is doesn't work.
> Instead, the graph must first be created with single-character edges.
> It is then examined what ranges can be constructed in a way that no single range
> overlaps any other range, so that every range can be seen as a character in an alphabet.
Hmm ... I might be misunderstanding, but I think our engine already
does a version of this. See the discussion of "colors" in
src/backend/regex/README.
> Another optimization I've come up with (or probably re-invented because it feels quite obvious),
> is to read more than one character, when knowing for sure multiple characters-in-a-row
> are expected, by concatenating edges having only one parent and one child.
Maybe. In practice the actual scanning tends to be tracking more than one
possible NFA state in parallel, so I'm not sure how often we could expect
to be able to use this idea. That is, even if we know that state X can
only succeed by following an arc to Y and then another to Z, we might
also be interested in what happens if the NFA is in state Q at this point;
and it seems unlikely that Q would have exactly the same two following
arc colors.
I do have some ideas about possible future optimizations, and one reason
I'm grateful for this large set of real regexes is that it can provide a
concrete basis for deciding that particular optimizations are or are not
worth pursuing. So thanks again for collecting it!
regards, tom lane