Thread: Database object names and libpq in UTF-8 locale on Windows

Database object names and libpq in UTF-8 locale on Windows

From
Sebastien FLAESCH
Date:
Hello,

Using PostgreSQL 9.2.1 on Windows, I am facing a strange character set issue
with a UTF-8 database.

Maybe this is expected, but want to be sure that I am not missing something.

On Windows, I have created a database with:
  ENCODING = 'UTF-8'  LC_COLLATE = 'English_United States.1252'  LC_CTYPE = 'English_United States.1252'

I am using libpq with a simple C program.
The program handles UTF-8 strings.

While it's not a traditional "UNICODE" Windows application using WCHAR, it
should be possible to send and get back UTF-8 data with PostgreSQL.

Interacting with the Windows system in the system locale is another story, but
from a pure C / SQL / libpq point of view, as long as the PostgreSQL client
encoding is properly defined to UTF-8, it should work, and it does mainly:

- I can use UTF-8 string constants in my queries.
- I can pass UTF-8 data to the database with parameterized queries.
- I can fetch UTF-8 data from the database.
- I can create db object names with UTF-8 characters.

But the db object names must be specified with double quotes:
When I do not use quoted db object names, I get a strange problem.
The table is created, I can use it in my program, I can even use it in the
pgAdmin query tool, but in the pgAdmin db browser, there is no table name
displayed in the treeview...

Further, when selecting schema information from pg_class, I can see that the
table exists, but there is nothing displayed in the relname attribute...

It appears that the problem disappears when using a "C" collation are char type:
  ENCODING = 'UTF-8'  LC_COLLATE = 'C'  LC_CTYPE = 'C'

I suspect this has something to do with the fact that non-quoted identifiers
are converted to lowercase, and because my LC_CTYPE is English_United States.1252,
the conversion to lowercase fails...

But:

- why does PostgreSQL accept to create the table with invalid UTF-8 characters?

- why can I make queries in the query tool with the UTF-8 table name?
(note that show client_encoding in the query tool gives me "UNICODE" - is this
the same as "UTF-8"?)


So is there a bug, or do I have to use a C collation and char type for UTF-8
databases on Windows?

I know that UTF-8 is not supported by the Windows C/POSIX library (setlocale,
and co).

Does it mean that it's not realistic to use UTF-8 encoding for PostgreSQL
databases on Windows...?

Thanks for reading.
Sebastien FLAESCH
Four Js Development Tools



Re: Database object names and libpq in UTF-8 locale on Windows

From
Sebastien FLAESCH
Date:
Hi all,
I would appreciate some help or comments on this topic...
Is this the wrong mailing list to ask such question?
Seb

On 10/18/2012 10:15 AM, Sebastien FLAESCH wrote:
> Hello,
>
> Using PostgreSQL 9.2.1 on Windows, I am facing a strange character set
> issue
> with a UTF-8 database.
>
> Maybe this is expected, but want to be sure that I am not missing
> something.
>
> On Windows, I have created a database with:
>
> ENCODING = 'UTF-8'
> LC_COLLATE = 'English_United States.1252'
> LC_CTYPE = 'English_United States.1252'
>
> I am using libpq with a simple C program.
> The program handles UTF-8 strings.
>
> While it's not a traditional "UNICODE" Windows application using WCHAR, it
> should be possible to send and get back UTF-8 data with PostgreSQL.
>
> Interacting with the Windows system in the system locale is another
> story, but
> from a pure C / SQL / libpq point of view, as long as the PostgreSQL client
> encoding is properly defined to UTF-8, it should work, and it does mainly:
>
> - I can use UTF-8 string constants in my queries.
> - I can pass UTF-8 data to the database with parameterized queries.
> - I can fetch UTF-8 data from the database.
> - I can create db object names with UTF-8 characters.
>
> But the db object names must be specified with double quotes:
> When I do not use quoted db object names, I get a strange problem.
> The table is created, I can use it in my program, I can even use it in the
> pgAdmin query tool, but in the pgAdmin db browser, there is no table name
> displayed in the treeview...
>
> Further, when selecting schema information from pg_class, I can see that
> the
> table exists, but there is nothing displayed in the relname attribute...
>
> It appears that the problem disappears when using a "C" collation are
> char type:
>
> ENCODING = 'UTF-8'
> LC_COLLATE = 'C'
> LC_CTYPE = 'C'
>
> I suspect this has something to do with the fact that non-quoted
> identifiers
> are converted to lowercase, and because my LC_CTYPE is English_United
> States.1252,
> the conversion to lowercase fails...
>
> But:
>
> - why does PostgreSQL accept to create the table with invalid UTF-8
> characters?
>
> - why can I make queries in the query tool with the UTF-8 table name?
> (note that show client_encoding in the query tool gives me "UNICODE" -
> is this
> the same as "UTF-8"?)
>
>
> So is there a bug, or do I have to use a C collation and char type for
> UTF-8
> databases on Windows?
>
> I know that UTF-8 is not supported by the Windows C/POSIX library
> (setlocale,
> and co).
>
> Does it mean that it's not realistic to use UTF-8 encoding for PostgreSQL
> databases on Windows...?
>
> Thanks for reading.
> Sebastien FLAESCH
> Four Js Development Tools
>
>




Re: Database object names and libpq in UTF-8 locale on Windows

From
Andrew Dunstan
Date:
On 10/22/2012 12:53 PM, Sebastien FLAESCH wrote:

[Issues with unquoted utf8 identifiers in Windows 1252 locale]

>> I suspect this has something to do with the fact that non-quoted
>> identifiers
>> are converted to lowercase, and because my LC_CTYPE is English_United
>> States.1252,
>> the conversion to lowercase fails...


Quite possibly. The code comment says this:
        /*         * SQL99 specifies Unicode-aware case normalization, which we   don't yet         * have the
infrastructurefor.  Instead we use tolower() to   provide a         * locale-aware translation.  However, there are
somelocales   where this         * is not right either (eg, Turkish may do strange things with   'i' and         *
'I'). Our current compromise is to use tolower() for   characters with         * the high bit set, and use an
ASCII-onlydowncasing for 7-bit         * characters.         */
 

For now your best bet is probably not to use UTF8 non-ascii chars or to 
quote the identifiers.

Given we're calling to_lower() on a single byte in the code referred to, 
should we even be doing that when we have a multi-byte encoding and the 
high bit is set?

Aside: I'd love to fix up our treatment of identifiers, but there is 
probably a LOT of very tedious work involved.

cheers

andrew





Re: Database object names and libpq in UTF-8 locale on Windows

From
Andrew Dunstan
Date:
> Given we're calling to_lower() on a single byte in the code referred
> to, should we even be doing that when we have a multi-byte encoding
> and the high bit is set?
>
>


Nobody responded to this, but I'm rather inclined to say we should not.

Here's a simple patch to avoid this case.

Comments?

cheers

andrew



Attachment

Re: Database object names and libpq in UTF-8 locale on Windows

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Here's a simple patch to avoid this case.
> Comments?

I'm not sure that's the only place we're doing this ...
        regards, tom lane



Re: Database object names and libpq in UTF-8 locale on Windows

From
Andrew Dunstan
Date:
On 11/21/2012 11:11 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Here's a simple patch to avoid this case.
>> Comments?
> I'm not sure that's the only place we're doing this ...
>
>             

Oh, Hmm, darn. Where else do you think we might?

cheers

andrew




Re: Database object names and libpq in UTF-8 locale on Windows

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 11/21/2012 11:11 AM, Tom Lane wrote:
>> I'm not sure that's the only place we're doing this ...

> Oh, Hmm, darn. Where else do you think we might?

Dunno, but grepping for isupper and/or tolower should find any such
places.
        regards, tom lane



Re: Database object names and libpq in UTF-8 locale on Windows

From
Sebastien FLAESCH
Date:
Tom, Andrew,

We have the same issue in our product: Support UTF-8 on Windows.

You know certainly that UTF-8 code page (65001) is no supported by MS Windows
when you set the locale with setlocale(). You cannot rely on standard libc
functions such as isalpha(), mbtowc(), mbstowc(), wctomb(), wcstombs(),
strcoll(), which depend on the current locale.

You should start to centralize all basic character-set related functions
(upper/lower, comparison, etc) in a library, to ease the port on Windows.

Then convert UTF-8 data to wide char and call wide char functions.

For example, to implement an uppercase() function:

1) Convert UTF-8 to Wide Char (algorithm can be easily found)
2) Use towupper()
3) Convert Wide Char result to UTF-8 (algorithm can be easily found)

To compare characters:

1) Convert s1 in UTF-8 to Wide Char => wcs1
2) Convert s2 in UTF-8 to Wide Char => wcs2
3) Use wcscoll(wcs1, wcs2)

Regards,
Seb

On 11/21/2012 06:07 PM, Tom Lane wrote:
> Andrew Dunstan<andrew@dunslane.net>  writes:
>> On 11/21/2012 11:11 AM, Tom Lane wrote:
>>> I'm not sure that's the only place we're doing this ...
>
>> Oh, Hmm, darn. Where else do you think we might?
>
> Dunno, but grepping for isupper and/or tolower should find any such
> places.
>
>             regards, tom lane
>




Re: Database object names and libpq in UTF-8 locale on Windows

From
Andrew Dunstan
Date:
On 11/21/2012 12:07 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 11/21/2012 11:11 AM, Tom Lane wrote:
>>> I'm not sure that's the only place we're doing this ...
>> Oh, Hmm, darn. Where else do you think we might?
> Dunno, but grepping for isupper and/or tolower should find any such
> places.
>
>             


I've eliminated some false positives from the grep results. and here's a 
list of list of the remaining files doing things that could be suspect. 
Haven't had time to dig more.
   src/backend/regex/regc_locale.c   src/backend/regex/regcomp.c   src/backend/regex/regc_pg_locale.c
src/backend/tsearch/ts_locale.c  src/backend/utils/adt/datetime.c   src/backend/utils/adt/formatting.c
src/backend/utils/adt/inet_net_pton.c  src/backend/utils/adt/like.c   src/backend/utils/misc/tzparser.c
 


I'd be tempted to say we should fix up the identifier issue regardless 
of the rest, though.

cheers

andrew