Thread: Question on inserting non-ascii strings

Question on inserting non-ascii strings

From
Steven Lembark
Date:
Using Postgres 8.3 with DBI 1.607, DBD::Pg 2.12.0, perl v5.10.0,

I am trying to insert the medline database
contents, which include non-ascii char's,
hopefully using a prepared query.

Playing with the locale, encoding,
client_encoding, standard_conforming_strings,
leaves me able to insert values using one-off
querys, but I need to find out the correct way
to handle these using prepared querys (currently
using C, SQL_ASCII, SQL_ASCII, off).

For example, given:

    create table foo ( bar varchar(255) );

and the author's name

    "P\x8FAZEK"

Setting the client_encoding to "SQL_ASCII"
does not help: the values can be inserted
via

    E'P\x8FAZEK'

but E'$1' will simply insert the "$1" literal
into the table.

Using

    convert_from( $1, 'SQL_ASCII' ) gets

gets the values input, but with a warning.

Q: Is there any combination of locale, encoding,
   client_encoding or functions that will allow
   me to insert values with these escape sequences
   without getting the warnings?

Trying this in psql with various combinations of
prepares statements leaves me unable to use
convert_from with a varchar argument (requires
bytea).

If there is an example in the doc's I'd appreciate
a link to it.

thanks

--
Steven Lembark                                            85-09 90th St.
Workhorse Computing                                 Woodhaven, NY, 11421
lembark@wrkhors.com                                      +1 888 359 3508

Re: Question on inserting non-ascii strings

From
Sam Mason
Date:
On Thu, May 14, 2009 at 11:25:33AM -0400, Steven Lembark wrote:
> Q: Is there any combination of locale, encoding,
>    client_encoding or functions that will allow
>    me to insert values with these escape sequences
>    without getting the warnings?
>
> Trying this in psql with various combinations of
> prepares statements leaves me unable to use
> convert_from with a varchar argument (requires
> bytea).

I think you're confused about where the escaping happens; in SQL,
escape sequences are expanded as it is being parsed (actually, it's the
step before known as "lex"ing, but the two stages are normally rolled
together unless precision is needed).  Parameters are passed after the
SQL has been parsed and hence no expansion of escape sequences in your
values is done.

You want to be using whatever language you're generating the parameter
from (Perl) to handle the expansion of escape sequences for you.  This
will cause the expanded string (i.e. the escapes have been interpreted)
to be sent to Postgres and everything should just work.  Unfortunately I
don't use Perl much, so can't give much in the way of a demo--hopefully
others will.

--
  Sam  http://samason.me.uk/

Re: Question on inserting non-ascii strings

From
Steven Lembark
Date:
On Thu, 14 May 2009 18:44:57 +0100
Sam Mason <sam@samason.me.uk> wrote:

> You want to be using whatever language you're generating the parameter
> from (Perl) to handle the expansion of escape sequences for you.  This
> will cause the expanded string (i.e. the escapes have been interpreted)
> to be sent to Postgres and everything should just work.  Unfortunately I
> don't use Perl much, so can't give much in the way of a demo--hopefully
> others will.

That is what I thought should happen, but
using a database with encoding of UTF8 and
client_encoding of UTF8 still gave me the
warnings.

thanx



--
Steven Lembark                                            85-09 90th St.
Workhorse Computing                                 Woodhaven, NY, 11421
lembark@wrkhors.com                                      +1 888 359 3508

Re: Question on inserting non-ascii strings

From
Sam Mason
Date:
On Thu, May 14, 2009 at 01:57:04PM -0400, Steven Lembark wrote:
> On Thu, 14 May 2009 18:44:57 +0100 Sam Mason <sam@samason.me.uk> wrote:
> > You want to be using whatever language you're generating the parameter
> > from (Perl) to handle the expansion of escape sequences for you.  This
> > will cause the expanded string (i.e. the escapes have been interpreted)
> > to be sent to Postgres and everything should just work.  Unfortunately I
> > don't use Perl much, so can't give much in the way of a demo--hopefully
> > others will.
>
> That is what I thought should happen, but
> using a database with encoding of UTF8 and
> client_encoding of UTF8 still gave me the
> warnings.

Which "warnings" are you talking about?  I thought you said you had
plain SQL working OK, but were struggling to pass parameters containing
UTF-8 encoded characters.

Are you sure that your Perl code is passing the string encoded as UTF8?

--
  Sam  http://samason.me.uk/

Re: Question on inserting non-ascii strings

From
Steven Lembark
Date:
> Which "warnings" are you talking about?  I thought you said you had
> plain SQL working OK, but were struggling to pass parameters containing
> UTF-8 encoded characters.
>
> Are you sure that your Perl code is passing the string encoded as UTF8?

Excellent point: Perl will only pass through the
converted UTF8, if I remember to convert it from
unicode!

The earlier version of the database used SQL_ASCII
and took the unicode byte as-is with a nastygram
about trans-ascii byte -- which led me down the
path of E'xxx'. While trying to make that work I'd
accidentally fat-figered out the utf8 conversion
attempting to comment it out.

Character   Unicode Code    Unicode Name                      UTF Encoding

(from http://www.nlm.nih.gov/databases/dtd/medline_character_database.html#notes)

    use utf8;

    x $c = "\x{F8}"

    x utf8::encode $c;

    x $c
    0  'ø'           <-- UTF8 for a slashed o.

    $sth->execute( $c )

    x $d = $dbh->selectall_arrayref( 'select * from foo' );
    0  ARRAY(0x18ef0d0)
       0  ARRAY(0x18cc1e8)
          0  'ø'     <-- bytes in correct order

    x  utf8::decode $d->[0][0]
    DB<106>  x $d
    0  ARRAY(0x18ef0d0)
       0  ARRAY(0x18cc1e8)


I'm still not sure whether using UTF8 or unicode is
the best way going forward, but will probably stick
with UTF8 in case I have to deal with any offball
character sets.

thanx

Re: Question on inserting non-ascii strings

From
Sam Mason
Date:
On Thu, May 14, 2009 at 07:39:41PM -0400, Steven Lembark wrote:
>
> > Which "warnings" are you talking about?  I thought you said you had
> > plain SQL working OK, but were struggling to pass parameters containing
> > UTF-8 encoded characters.
> >
> > Are you sure that your Perl code is passing the string encoded as UTF8?
>
> Excellent point: Perl will only pass through the
> converted UTF8, if I remember to convert it from
> unicode!

Probably a minor point; but Unicode doesn't mean much to me here,
Unicode defines a set of code-points and their mapping to "characters".
These code-points are always encoded in some form whether that be UTF-8,
UTF-16 or something else.  Strings may be handled in Perl as Unicode
strings of unspecified encoding (I don't know) and hence you need to
specify what encoding you want them in if you ask for an array of bytes.

> The earlier version of the database used SQL_ASCII
> and took the unicode byte as-is with a nastygram
> about trans-ascii byte -- which led me down the
> path of E'xxx'. While trying to make that work I'd
> accidentally fat-figered out the utf8 conversion
> attempting to comment it out.

Doh, easy to do.  One things that I'd be tempted to do to make sure your
code is doing the right thing would be to test that the characters are
correct using another client--psql would be a reasonable choice.  That
way you can have some more confidence that if you have to get other code
to talk to the database it's actually doing what you want it to be doing
and it's not just Perl being consistent with itself, as it should be.

I know I've saved myself a few times doing similar things like this.

--
  Sam  http://samason.me.uk/