Thread: Lossy character conversion to Latin-1
I have a client that only supports Latin-1 and needs to connect to a UTF-8 database to retrieve some data. Some columns may contain characters that have no Latin-1 equivalent. I would like to convert these to a blank or perhaps some hex value. Is there any way to do this in PostgreSQL without using anything other than built in functions or pl/pgsql? It would be nice if the built in convert function had an option to handle this rather than only generating an error. Any pointers to an existing pl/pgsql function to perform this conversion? Thanks, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
John DeSoi wrote: > I have a client that only supports Latin-1 and needs to connect to a > UTF-8 database to retrieve some data. Some columns may contain > characters that have no Latin-1 equivalent. I would like to convert > these to a blank or perhaps some hex value. Is there any way to do this > in PostgreSQL without using anything other than built in functions or > pl/pgsql? It would be nice if the built in convert function had an > option to handle this rather than only generating an error. Any pointers > to an existing pl/pgsql function to perform this conversion? You should be able to do this with regular expressions (substituting all invalid chars) on the text columns of the result set(s). I write "should be", since I don't remember the capabilities of pg regular expressions. /Nis
On May 31, 2006, at 5:17 AM, Nis Jorgensen wrote: > You should be able to do this with regular expressions > (substituting all > invalid chars) on the text columns of the result set(s). I write > "should be", since I don't remember the capabilities of pg regular > expressions. I thought about this, but I don't see how to make it work to transform every character. The substring function accepts POSIX regular expressions, but the replace function does not. Maybe I can put something together using substring and position. Thanks, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
> I have a client that only supports Latin-1 and needs to connect to a > UTF-8 database to retrieve some data. Some columns may contain > characters that have no Latin-1 equivalent. I would like to convert > these to a blank or perhaps some hex value. Is there any way to do > this in PostgreSQL without using anything other than built in > functions or pl/pgsql? It would be nice if the built in convert > function had an option to handle this rather than only generating an > error. Any pointers to an existing pl/pgsql function to perform this > conversion? It should be easy to write user defined funtion and define your own CONVERSION. CREATE CONVERSION is your friend. -- Tatsuo Ishii SRA OSS, Inc. Japan
On May 31, 2006, at 9:26 AM, Tatsuo Ishii wrote: > It should be easy to write user defined funtion and define your own > CONVERSION. CREATE CONVERSION is your friend. It looks like CREATE CONVERSION requires a C function to do the conversion. This will be used in a hosted environment -- I won't be able to add any C language functions. Am I misunderstanding the documentation? John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
On May 31, 2006, at 9:17 AM, Nis Jorgensen wrote: > regexp_replace seems to do what you need: > > http://www.postgresql.org/docs/8.1/static/functions-matching.html > > Something like > > regexp_replace (field, '[^\u0000-\u00FF]', '?', 'g') Yes! Thanks very much -- I looked at that page several times and missed regexp_replace. Here is an example: === psql 3 === select regexp_replace('©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý āŹźŻżŽž ∧∨ wxyz', '[^\\u0000-\\u00FF]', '?', 'g'); regexp_replace ------------------------------------------ ©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý ??????? ?? wxyz (1 row) John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL
On May 31, 2006, at 10:40 AM, John DeSoi wrote: > Yes! Thanks very much -- I looked at that page several times and > missed regexp_replace. Ok, now I know why I missed it. regexp_replace is only in PostgreSQL 8.1 and later. I'm stuck with 8.0 for hosting at the moment. I'm sure it is not very efficient, but the plpgsql function below does the same job for PostgreSQL versions prior to 8.1. Thanks for the help, John DeSoi, Ph.D. http://pgedit.com/ Power Tools for PostgreSQL create or replace function lossy_latin(p_text text) returns text as $$ declare t text; res text := ''; ch text := substring(p_text from '[^\\u0000-\\u00FF]'); pos integer; begin if ch is null then return p_text; else t := p_text; loop pos := strpos(t, ch); res := res || substr(t, 1, pos - 1) || '?'; t := substr(t, pos + 1); ch := substring(t from '[^\\u0000-\\u00FF]'); if ch is null then res := res || t; exit; end if; end loop; return res; end if; end; $$ language plpgsql immutable;
> > It should be easy to write user defined funtion and define your own > > CONVERSION. CREATE CONVERSION is your friend. > > It looks like CREATE CONVERSION requires a C function to do the > conversion. This will be used in a hosted environment -- I won't be > able to add any C language functions. Am I misunderstanding the > documentation? You cannot write your own C function? Too bad. Maybe now is the time to change the hosted environment... -- Tatsuo Ishii SRA OSS, Inc. Japan
John DeSoi wrote: > > On May 31, 2006, at 10:40 AM, John DeSoi wrote: > >> Yes! Thanks very much -- I looked at that page several times and >> missed regexp_replace. > > > Ok, now I know why I missed it. regexp_replace is only in PostgreSQL 8.1 > and later. I'm stuck with 8.0 for hosting at the moment. > > I'm sure it is not very efficient, but the plpgsql function below does > the same job for PostgreSQL versions prior to 8.1. > Command Prompt supports 8.1 for PostgreSQL if you need a new environment. Sincerely, Joshua D. Drake > Thanks for the help, > > > John DeSoi, Ph.D. > http://pgedit.com/ > Power Tools for PostgreSQL > > > > create or replace function lossy_latin(p_text text) > returns text as $$ > declare > t text; > res text := ''; > ch text := substring(p_text from '[^\\u0000-\\u00FF]'); > pos integer; > begin > if ch is null then > return p_text; > else > t := p_text; > loop > pos := strpos(t, ch); > res := res || substr(t, 1, pos - 1) || '?'; > t := substr(t, pos + 1); > ch := substring(t from '[^\\u0000-\\u00FF]'); > if ch is null then > res := res || t; > exit; > end if; > end loop; > return res; > end if; > end; > $$ language plpgsql immutable; > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/