Thread: Server-side support of all encodings

Server-side support of all encodings

From
ITAGAKI Takahiro
Date:
Hello,

PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
but we cannot use them as server encodings. Are there any reason for it?
AFAICS, we can support them only if we add each pg_xxx2wchar_with_len().

I'd like to add server-side SJIS supports for Windows Japanese edition.
Its native encoding is SJIS, so the C library expects SJIS characters are
passed if we set locale='Japanese'. However, we only support EUC_jp and
UTF-8 as valid Japanese encodings.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: Server-side support of all encodings

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> but we cannot use them as server encodings. Are there any reason for it?

Very much so --- they aren't safe ASCII-supersets, and thus for example
the parser will fail on them.  Backend encodings must have the property
that all bytes of a multibyte character are >= 128.
        regards, tom lane


Re: Server-side support of all encodings

From
ITAGAKI Takahiro
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> > but we cannot use them as server encodings. Are there any reason for it?
> 
> Very much so --- they aren't safe ASCII-supersets, and thus for example
> the parser will fail on them.  Backend encodings must have the property
> that all bytes of a multibyte character are >= 128.

But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
Trailing bytes of JOHAB can be less than 128.
It's true that other server-supported encodings use only characters >= 128.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Server-side support of all encodings

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Backend encodings must have the property
>> that all bytes of a multibyte character are >= 128.

> But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> Trailing bytes of JOHAB can be less than 128.

In that case we must remove JOHAB from the list of allowed server
encodings.  Tatsuo, can you comment on whether this is correct?
        regards, tom lane


Re: Server-side support of all encodings

From
"Ioseph Kim"
Date:
At Korea, Johab code is very old encondig.
by the way, cp949 code page is really used in most environments.

Personally speaking, Johab server code set is not need.
I think that PostgreSQL supports UHC (cp949) server  code set.
This feature will be greet many Korean. :)
Unfortunately, UHC code set have character sequences less then 128 byte.

I tred to patch this problem, but this is not simply. I had gave up. :(


----- Original Message ----- 
From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: <pgsql-hackers@postgresql.org>
Sent: Monday, March 26, 2007 11:29 AM
Subject: Re: [HACKERS] Server-side support of all encodings


>
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client 
>> > encodings,
>> > but we cannot use them as server encodings. Are there any reason for 
>> > it?
>>
>> Very much so --- they aren't safe ASCII-supersets, and thus for example
>> the parser will fail on them.  Backend encodings must have the property
>> that all bytes of a multibyte character are >= 128.
>
> But then, PG_JOHAB have already infringed it. Please see 
> johab_to_utf8.map.
> Trailing bytes of JOHAB can be less than 128.
> It's true that other server-supported encodings use only characters >= 
> 128.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
> 



Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Backend encodings must have the property
> >> that all bytes of a multibyte character are >= 128.
> 
> > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> > Trailing bytes of JOHAB can be less than 128.
> 
> In that case we must remove JOHAB from the list of allowed server
> encodings.  Tatsuo, can you comment on whether this is correct?

Sigh. From the first day when JOHAB was supported (back to 7.3 days),
it should had not been in the server encodings. JOHAB's second byte
definitely contain 0x41 and above. *johab*.map just reflect the
fact. I think we should remove JOHAB from the server encodings list.
I'm afraid users who have JOHAB encoded databases get angry, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Backend encodings must have the property
> >> that all bytes of a multibyte character are >= 128.
> 
> > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> > Trailing bytes of JOHAB can be less than 128.
> 
> In that case we must remove JOHAB from the list of allowed server
> encodings.  Tatsuo, can you comment on whether this is correct?

Sigh. From the first day when JOHAB was supported (back to 7.3 days),
it should had not been in the server encodings. JOHAB's second byte
definitely contain 0x41 and above. *johab*.map just reflect the
fact. I think we should remove JOHAB from the server encodings list.
I'm afraid users who have JOHAB encoded databases get angry, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tom Lane
Date:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Sigh. From the first day when JOHAB was supported (back to 7.3 days),
> it should had not been in the server encodings. JOHAB's second byte
> definitely contain 0x41 and above. *johab*.map just reflect the
> fact. I think we should remove JOHAB from the server encodings list.
> I'm afraid users who have JOHAB encoded databases get angry, though.

I think the best way to proceed is probably to fix this in HEAD but
not back-patch it.  During a dump and reload the encoding can be
corrected to something safe.
        regards, tom lane


Re: Server-side support of all encodings

From
"Dezso Zoltan"
Date:
Hello Everyone,

I very much understand why SJIS is not a server encoding. It contains
ASCII second bytes (including \ and ' both of which can be really
nasty inside a normal sql) and further, half-width katakana is
represented as one byte-characters, incidentally two of which coincide
with a kanji.

My question is, however: what would be the best practice if it was
imperative to use SJIS encoding for texts and no built-in conversions
are useful? To elaborate, I need to support japanese emoji characters,
which are special emoticons for mobile phones. These characters are
usually in a region that is not specified by the standard SJIS,
therefore they are not properly converted either to EUC or UTF8 (which
would be my prefered choice, but unfortunately not all mobile phones
support it, so conversion is still necessary - from what i've seen,
the new SJIS_2004 map seems to define these entities, but I'm not 100%
sure they all get converted properly).

I inherited a system in which this problem is "bypassed" by setting
SQL_ASCII server encoding, but that is not the best solution (full
text search is rendered useless and occasionally the special character
issue rears its ugly head - not only do we have to deal with normal
sqlinjection, but also encoding-based injections) (and for the real
WTF, my predecessor converted everything to EUC before inserting -
eventually losing all the emojis and creating all sorts of strange
phenomena, like tables with one column in euc until a certain date and
sjis from then on while euc for all other columns)

Is there a way to properly deal with sjis+emoji extensions (a patch
i'm not aware of, for example), is it considered as a todo for further
releases or should i consider augmenting postgres in a way (if the
latter, could you provide any pointers on how to proceed?)

Thank you,
Zaki

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Monday, March 26, 2007 11:20 AM
To: ITAGAKI Takahiro
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Server-side support of all encodings

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> but we cannot use them as server encodings. Are there any reason for it?

Very much so --- they aren't safe ASCII-supersets, and thus for example
the parser will fail on them.  Backend encodings must have the property
that all bytes of a multibyte character are >= 128.
        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to      choose an index scan if your joining column's
datatypesdo not      match
 


Re: Server-side support of all encodings

From
Martijn van Oosterhout
Date:
On Wed, Mar 28, 2007 at 10:44:00AM +0900, Dezso Zoltan wrote:
> My question is, however: what would be the best practice if it was
> imperative to use SJIS encoding for texts and no built-in conversions
> are useful? To elaborate, I need to support japanese emoji characters,
> which are special emoticons for mobile phones. These characters are
> usually in a region that is not specified by the standard SJIS,

Unicode also defines a region for user-defined characters. While it may
be odd, perhaps it would be reasonable to map the user-defined space in
SJIS to somewhere in the user-defined pace in Unicode, so that at least
you round-trip consistancy.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> Hello Everyone,
> 
> I very much understand why SJIS is not a server encoding. It contains
> ASCII second bytes (including \ and ' both of which can be really
> nasty inside a normal sql) and further, half-width katakana is
> represented as one byte-characters, incidentally two of which coincide
> with a kanji.
> 
> My question is, however: what would be the best practice if it was
> imperative to use SJIS encoding for texts and no built-in conversions
> are useful? To elaborate, I need to support japanese emoji characters,
> which are special emoticons for mobile phones. These characters are
> usually in a region that is not specified by the standard SJIS,
> therefore they are not properly converted either to EUC or UTF8 (which
> would be my prefered choice, but unfortunately not all mobile phones
> support it, so conversion is still necessary - from what i've seen,
> the new SJIS_2004 map seems to define these entities, but I'm not 100%
> sure they all get converted properly).
> 
> I inherited a system in which this problem is "bypassed" by setting
> SQL_ASCII server encoding, but that is not the best solution (full
> text search is rendered useless and occasionally the special character
> issue rears its ugly head - not only do we have to deal with normal
> sqlinjection, but also encoding-based injections) (and for the real
> WTF, my predecessor converted everything to EUC before inserting -
> eventually losing all the emojis and creating all sorts of strange
> phenomena, like tables with one column in euc until a certain date and
> sjis from then on while euc for all other columns)
> 
> Is there a way to properly deal with sjis+emoji extensions (a patch
> i'm not aware of, for example), is it considered as a todo for further
> releases or should i consider augmenting postgres in a way (if the
> latter, could you provide any pointers on how to proceed?)

You can always use CREATE CONVERSION for this kind of purpose.
Create your own conversion map between SJIS <--> EUC or UT-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> Hello Everyone,
> 
> I very much understand why SJIS is not a server encoding. It contains
> ASCII second bytes (including \ and ' both of which can be really
> nasty inside a normal sql) and further, half-width katakana is
> represented as one byte-characters, incidentally two of which coincide
> with a kanji.
> 
> My question is, however: what would be the best practice if it was
> imperative to use SJIS encoding for texts and no built-in conversions
> are useful? To elaborate, I need to support japanese emoji characters,
> which are special emoticons for mobile phones. These characters are
> usually in a region that is not specified by the standard SJIS,
> therefore they are not properly converted either to EUC or UTF8 (which
> would be my prefered choice, but unfortunately not all mobile phones
> support it, so conversion is still necessary - from what i've seen,
> the new SJIS_2004 map seems to define these entities, but I'm not 100%
> sure they all get converted properly).
> 
> I inherited a system in which this problem is "bypassed" by setting
> SQL_ASCII server encoding, but that is not the best solution (full
> text search is rendered useless and occasionally the special character
> issue rears its ugly head - not only do we have to deal with normal
> sqlinjection, but also encoding-based injections) (and for the real
> WTF, my predecessor converted everything to EUC before inserting -
> eventually losing all the emojis and creating all sorts of strange
> phenomena, like tables with one column in euc until a certain date and
> sjis from then on while euc for all other columns)
> 
> Is there a way to properly deal with sjis+emoji extensions (a patch
> i'm not aware of, for example), is it considered as a todo for further
> releases or should i consider augmenting postgres in a way (if the
> latter, could you provide any pointers on how to proceed?)

You can always use CREATE CONVERSION for this kind of purpose.
Create your own conversion map between SJIS <--> EUC or UT-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> > Sigh. From the first day when JOHAB was supported (back to 7.3 days),
> > it should had not been in the server encodings. JOHAB's second byte
> > definitely contain 0x41 and above. *johab*.map just reflect the
> > fact. I think we should remove JOHAB from the server encodings list.
> > I'm afraid users who have JOHAB encoded databases get angry, though.
> 
> I think the best way to proceed is probably to fix this in HEAD but
> not back-patch it.  During a dump and reload the encoding can be
> corrected to something safe.

Ok. Shall I go ahead and remove JOHAB in HEAD?
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
>> I think the best way to proceed is probably to fix this in HEAD but
>> not back-patch it.  During a dump and reload the encoding can be
>> corrected to something safe.

> Ok. Shall I go ahead and remove JOHAB in HEAD?

+1 for me.
        regards, tom lane


Re: Server-side support of all encodings

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <ishii@postgresql.org> writes:
> >> I think the best way to proceed is probably to fix this in HEAD but
> >> not back-patch it.  During a dump and reload the encoding can be
> >> corrected to something safe.
> 
> > Ok. Shall I go ahead and remove JOHAB in HEAD?
> 
> +1 for me.
> 
>             regards, tom lane

Done.

BTW, do we have to modify pg_dump or pg_restore so that it can
automatically adjust JOHAB to UTF8 (it's the only safe encoding
compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe
documenting in the release note is enough? I guess that there is 0
users who are using JOHAB.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


Re: Server-side support of all encodings

From
Tom Lane
Date:
Tatsuo Ishii <ishii@postgresql.org> writes:
> BTW, do we have to modify pg_dump or pg_restore so that it can
> automatically adjust JOHAB to UTF8 (it's the only safe encoding
> compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe
> documenting in the release note is enough?

Do we actually need to do anything?  Dumps taken in client_encoding
JOHAB could exist regardless of the source server_encoding --- the
same is true of other client-only encodings.  Such dumps should load
fine into a UTF8 server_encoding database, as long as we have the right
conversion available.

I can imagine someone wanting to take a dump in a client-only encoding
for other reasons (export of the data to somewhere else, say) so I don't
think pg_dump should try to prevent it.
        regards, tom lane


Re: Server-side support of all encodings

From
"William ZHANG"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us>
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
>> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
>> but we cannot use them as server encodings. Are there any reason for it?
>
> Very much so --- they aren't safe ASCII-supersets, and thus for example
> the parser will fail on them.  Backend encodings must have the property
> that all bytes of a multibyte character are >= 128.

Sorry. I still cannot understand why backend encodings must have this
property. AFAIK, the parser treats characters as ASCII. So any multi-byte
characters will be treated as two or more ASCII characters. But if
the multi-byte encoding doesnot use any special ASCII characters like
single quote('), double quote(") and backslash(\), I think the parser
can deal with it correctly. A quick search in
src\backend\utils\mb\Unicode\*.map tells me that no encoding uses
single quote or double quote, but JOHAB, GBK, GB18030, BIG5, SJIS
use backslash. Since pgsql doesnot accept backslash as escape character
in identity(double quoted string) or value(single quoted string)
any more, I think the parser/scanner can process multi-bytes characters
correctly.

Thanks in advance.
William ZHANG

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
> 




Re: Server-side support of all encodings

From
Tom Lane
Date:
"William ZHANG" <zedware@gmail.com> writes:
> Sorry. I still cannot understand why backend encodings must have this
> property. AFAIK, the parser treats characters as ASCII. So any multi-byte
> characters will be treated as two or more ASCII characters. But if
> the multi-byte encoding doesnot use any special ASCII characters like
> single quote('), double quote(") and backslash(\), I think the parser
> can deal with it correctly.

You've got your attention too narrowly focused on strings inside quotes;
it's strings outside quotes that are the problem.

As an example, I see that gb18030 defines characters like 97 7e.
If someone tried to use that as a character of a SQL identifier
--- something that'd work fine for the UTF8 equivalent e6 a2 a1
--- the parser would see it as an identifier byte followed by
the operator ~.

Similarly, there are problems if we were to allow these character sets
for the pattern argument of a regular expression operator, or for any
datatype at all that can be embedded in an array constant.  And for PL
languages that feed prosrc strings into external interpreters, such as
Perl or R, it gets really interesting really quickly :-(.

It is possible that some of these encodings could be allowed without
any risks, but I don't think it is worth our time to grovel through
each valid character and every possible backend situation to determine
safety.  The risks are not always obvious --- see for instance the
security holes we fixed about a year ago in 8.1.4 et al --- and so
I for one would never have a lot of faith in there not being any holes.
The rule "no ASCII-aliasing characters" is a simple one that we can have
some confidence in.
        regards, tom lane