Thread: Lossy character conversion to Latin-1

Lossy character conversion to Latin-1

From
John DeSoi
Date:
I have a client that only supports Latin-1 and needs to connect to a
UTF-8 database to retrieve some data. Some columns may contain
characters that have no Latin-1 equivalent. I would like to convert
these to a blank or perhaps some hex value. Is there any way to do
this in PostgreSQL without using anything other than built in
functions or pl/pgsql? It would be nice if the built in convert
function had an option to handle this rather than only generating an
error. Any pointers to an existing pl/pgsql function to perform this
conversion?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


Re: Lossy character conversion to Latin-1

From
Nis Jorgensen
Date:
John DeSoi wrote:
> I have a client that only supports Latin-1 and needs to connect to a
> UTF-8 database to retrieve some data. Some columns may contain
> characters that have no Latin-1 equivalent. I would like to convert
> these to a blank or perhaps some hex value. Is there any way to do this
> in PostgreSQL without using anything other than built in functions or
> pl/pgsql? It would be nice if the built in convert function had an
> option to handle this rather than only generating an error. Any pointers
> to an existing pl/pgsql function to perform this conversion?

You should be able to do this with regular expressions (substituting all
invalid chars) on the text columns of the result set(s). I write "should
be", since I don't remember the capabilities of pg regular expressions.

/Nis



Re: Lossy character conversion to Latin-1

From
John DeSoi
Date:
On May 31, 2006, at 5:17 AM, Nis Jorgensen wrote:

> You should be able to do this with regular expressions
> (substituting all
> invalid chars) on the text columns of the result set(s). I write
> "should be", since I don't remember the capabilities of pg regular
> expressions.

I thought about this, but I don't see how to make it work to
transform every character. The substring function accepts POSIX
regular expressions, but the replace function does not. Maybe I can
put something together using substring and position.

Thanks,


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


Re: Lossy character conversion to Latin-1

From
Tatsuo Ishii
Date:
> I have a client that only supports Latin-1 and needs to connect to a
> UTF-8 database to retrieve some data. Some columns may contain
> characters that have no Latin-1 equivalent. I would like to convert
> these to a blank or perhaps some hex value. Is there any way to do
> this in PostgreSQL without using anything other than built in
> functions or pl/pgsql? It would be nice if the built in convert
> function had an option to handle this rather than only generating an
> error. Any pointers to an existing pl/pgsql function to perform this
> conversion?

It should be easy to write user defined funtion and define your own
CONVERSION.  CREATE CONVERSION is your friend.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Lossy character conversion to Latin-1

From
John DeSoi
Date:
On May 31, 2006, at 9:26 AM, Tatsuo Ishii wrote:

> It should be easy to write user defined funtion and define your own
> CONVERSION.  CREATE CONVERSION is your friend.

It looks like CREATE CONVERSION requires a C function to do the
conversion. This will be used in a hosted environment -- I won't be
able to add any C language functions. Am I misunderstanding the
documentation?



John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL




Re: Lossy character conversion to Latin-1

From
John DeSoi
Date:
On May 31, 2006, at 9:17 AM, Nis Jorgensen wrote:

> regexp_replace seems to do what you need:
>
> http://www.postgresql.org/docs/8.1/static/functions-matching.html
>
> Something like
>
> regexp_replace (field, '[^\u0000-\u00FF]', '?', 'g')

Yes! Thanks very much -- I looked at that page several times and
missed regexp_replace.

Here is an example:

=== psql 3 ===
select regexp_replace('©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý
āŹźŻżŽž ∧∨ wxyz', '[^\\u0000-\\u00FF]', '?', 'g');
               regexp_replace
------------------------------------------
©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý ??????? ?? wxyz
(1 row)




John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


Re: Lossy character conversion to Latin-1

From
John DeSoi
Date:
On May 31, 2006, at 10:40 AM, John DeSoi wrote:

> Yes! Thanks very much -- I looked at that page several times and
> missed regexp_replace.


Ok, now I know why I missed it. regexp_replace is only in PostgreSQL
8.1 and later. I'm stuck with 8.0 for hosting at the moment.

I'm sure it is not very efficient, but the plpgsql function below
does the same job for PostgreSQL versions prior to 8.1.

Thanks for the help,


John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL



create or replace function lossy_latin(p_text text)
returns text as $$
declare
    t text;
    res text := '';
    ch text := substring(p_text from '[^\\u0000-\\u00FF]');
    pos integer;
begin
    if ch is null then
        return p_text;
    else
        t := p_text;
        loop
            pos := strpos(t, ch);
            res := res || substr(t, 1, pos - 1) || '?';
            t := substr(t, pos + 1);
            ch := substring(t from '[^\\u0000-\\u00FF]');
            if ch is null then
                res := res || t;
                exit;
            end if;
        end loop;
        return res;
    end if;
end;
$$ language plpgsql immutable;

Re: Lossy character conversion to Latin-1

From
Tatsuo Ishii
Date:
> > It should be easy to write user defined funtion and define your own
> > CONVERSION.  CREATE CONVERSION is your friend.
>
> It looks like CREATE CONVERSION requires a C function to do the
> conversion. This will be used in a hosted environment -- I won't be
> able to add any C language functions. Am I misunderstanding the
> documentation?

You cannot write your own C function? Too bad. Maybe now is the time
to change the hosted environment...
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Lossy character conversion to Latin-1

From
"Joshua D. Drake"
Date:
John DeSoi wrote:
>
> On May 31, 2006, at 10:40 AM, John DeSoi wrote:
>
>> Yes! Thanks very much -- I looked at that page several times and
>> missed regexp_replace.
>
>
> Ok, now I know why I missed it. regexp_replace is only in PostgreSQL 8.1
> and later. I'm stuck with 8.0 for hosting at the moment.
>
> I'm sure it is not very efficient, but the plpgsql function below does
> the same job for PostgreSQL versions prior to 8.1.
>

Command Prompt supports 8.1 for PostgreSQL if you need a new environment.

Sincerely,

Joshua D. Drake



> Thanks for the help,
>
>
> John DeSoi, Ph.D.
> http://pgedit.com/
> Power Tools for PostgreSQL
>
>
>
> create or replace function lossy_latin(p_text text)
> returns text as $$
> declare
>     t text;
>     res text := '';
>     ch text := substring(p_text from '[^\\u0000-\\u00FF]');
>     pos integer;
> begin
>     if ch is null then
>         return p_text;
>     else
>         t := p_text;
>         loop
>             pos := strpos(t, ch);
>             res := res || substr(t, 1, pos - 1) || '?';
>             t := substr(t, pos + 1);
>             ch := substring(t from '[^\\u0000-\\u00FF]');
>             if ch is null then
>                 res := res || t;
>                 exit;
>             end if;
>         end loop;
>         return res;
>     end if;
> end;
> $$ language plpgsql immutable;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>


--

    === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
    Providing the most comprehensive  PostgreSQL solutions since 1997
              http://www.commandprompt.com/