Hello experts,
I want to compare integer arrays basically with methods based on string
similarity (i.e., levenshtein, trigrams etc).. In order to do that I hacked
a custom function that converts those integer array to strings, where each
integer is converted to a character by the function CHR(my_array1[i]+64) (so
that 1->A, 2 ->B etc). This hack of course for large integers (I have
integers up to 300,000) probably creates invalid UTF-8 characters.
Levenshtein (from fuzzystrmatch module) does not seem to have a problem with
that and works perfectly, since it is based on just comparing UTF8 codes. On
the other hand when I try similarity function
array1<->array1 for some cases it works (I think it works for all integers
up to 4096) but for some larger indexes I get invalid byte sequence for
encoding "UTF8" errors:
Example integer sequence
"8527,63586,8526,63585,63584,63583,63582,8525,8760,63820,63821,63822,860,57610,861,57611,862,57612,57613,863,57614,57615,57616,39850,39851,39852,39853,39854,39855,95275,39856,39857,95276,95277,39858,95278,95279,39859,95280,39860,95281,95282,39861,39862,39863,95283,95284,27095,27096,82406,82407,27097,27098,27099,27100,82408,27101,27102,27103,25702,80837,25703,25704,80838,25705,25706,25707,25708,30011,85343,30012,85344,30013,30014,51019,48260,48261,56809,56810,56811,56812,113829,31762,87568,31763,45925,41778,41779,41780,31778,31779,87571}";
Error message:
invalid byte sequence for encoding "UTF8": 0xed 0xb8 0xa9
Is there a way to suppress these errors similar to levenshtein which does
not care about validity of UTF characters?
--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Pg-trgm-and-invalid-invalid-byte-sequence-for-encoding-UTF8-tp5791681.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.