On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov@gmail.com> writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be? Colors are colors. The regexp machinery doesn't
> care whether they represent alphanumerics or not. (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)
The regexp machinery doesn't care, but the trigrams that pg_trgm
extracts only contain alphanumeric characters. So if by looking at the
CNFA graph produced by the regexp machinery you conclude that any
matching strings must contain three-letter sequences "%oo" and "#oo",
you can just club them together into " oo" trigram.
I think you can run a pre-processing step to the colors, and merge
colors that are equivalent as far as trigrams are considered. For
example, if you have a color that contains only character '%', and
another that contains character '#', you can treat them as the same
hcolor. You might then be able to simplify the CNFA. Actually, it would
be even better if you could apply the pre-processing to the regexp
before the regexp machinery turns it into a CNFA. Not sure how easy it
would be to do such pre-processing.
BTW, why create the path matrix? You could check the "check" array of
trigrams in the consistent function directly against the graph.
Consistent should return true, iff there is a path through the graph
following only arcs that contain trigrams present in the check array.
Finding a path through a complex graph could be expensive, O(|E|), but
if the path is complex, the path matrix would be large as well, and
checking against a large matrix isn't exactly free either. It would
allow you to avoid "overflows" caused by having too many paths through
the graph.
- Heikki