Is there a way to consider white space in tri-grams? That would allow for better matches of phrases.
For example, currently "one two three" and "three two one" would generate the same tri-grams ({ o, t, on, th, tw,ee ,hre,ne ,one,ree,thr,two,wo }), and the distance of "one two four" will be the same for both of them. The query:
SELECT phrase
,input
,similarity(t1.phrase, t2.input)
,word_similarity(t1.phrase, t2.input)
FROM (values('one two three'),('three two one')) t1(phrase)
,(values('one two four')) t2(input);
Returns:
phrase |input |similarity |word_similarity |
--------------|-------------|------------|----------------|
one two three |one two four |0.444444448 |0.615384638 |
three two one |one two four |0.444444448 |0.615384638 |
But surely "one two four" is more similar to "one two three" than to "three two one".
Any thoughts?