Home > mailing lists

Pg_trgm and "invalid invalid byte sequence for encoding UTF8" - Mailing list pgsql-general

From	alexandros_e
Subject	Pg_trgm and "invalid invalid byte sequence for encoding UTF8"
Date	February 12, 2014 20:21:08
Msg-id	1392236457460-5791681.post@n5.nabble.com Whole thread
List	pgsql-general

Tree view

Hello experts,

I want to compare integer arrays basically with methods based on string
similarity (i.e., levenshtein, trigrams etc).. In order to do that I hacked
a custom function that converts those integer array to strings, where each
integer is converted to a character by the function CHR(my_array1[i]+64) (so
that 1->A, 2 ->B etc). This hack of course for large integers (I have
integers up to 300,000) probably creates invalid UTF-8 characters.
Levenshtein (from fuzzystrmatch module) does not seem to have a problem with
that and works perfectly, since it is based on just comparing UTF8 codes. On
the other hand when I try similarity function
array1<->array1  for some cases it works (I think it works for all integers
up to 4096) but for some larger indexes I get invalid byte sequence for
encoding "UTF8" errors:

Example integer sequence


"8527,63586,8526,63585,63584,63583,63582,8525,8760,63820,63821,63822,860,57610,861,57611,862,57612,57613,863,57614,57615,57616,39850,39851,39852,39853,39854,39855,95275,39856,39857,95276,95277,39858,95278,95279,39859,95280,39860,95281,95282,39861,39862,39863,95283,95284,27095,27096,82406,82407,27097,27098,27099,27100,82408,27101,27102,27103,25702,80837,25703,25704,80838,25705,25706,25707,25708,30011,85343,30012,85344,30013,30014,51019,48260,48261,56809,56810,56811,56812,113829,31762,87568,31763,45925,41778,41779,41780,31778,31779,87571}";

Error message:

invalid byte sequence for encoding "UTF8": 0xed 0xb8 0xa9

Is there a way to suppress these errors similar to levenshtein which does
not care about validity of UTF characters?



--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Pg-trgm-and-invalid-invalid-byte-sequence-for-encoding-UTF8-tp5791681.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

pgsql-general by date:

From: Leonardo M. Ramé
Date: 12 February 2014, 19:46:12
Subject: Re: pg_restore issue

From: Jerry Sievers
Date: 12 February 2014, 20:37:08
Subject: Re: pg_restore issue

Pg_trgm and "invalid invalid byte sequence for encoding UTF8" - Mailing list pgsql-general

Previous

Next