Thread: Server-side support of all encodings
Hello, PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings, but we cannot use them as server encodings. Are there any reason for it? AFAICS, we can support them only if we add each pg_xxx2wchar_with_len(). I'd like to add server-side SJIS supports for Windows Japanese edition. Its native encoding is SJIS, so the C library expects SJIS characters are passed if we set locale='Japanese'. However, we only support EUC_jp and UTF-8 as valid Japanese encodings. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings, > but we cannot use them as server encodings. Are there any reason for it? Very much so --- they aren't safe ASCII-supersets, and thus for example the parser will fail on them. Backend encodings must have the property that all bytes of a multibyte character are >= 128. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings, > > but we cannot use them as server encodings. Are there any reason for it? > > Very much so --- they aren't safe ASCII-supersets, and thus for example > the parser will fail on them. Backend encodings must have the property > that all bytes of a multibyte character are >= 128. But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map. Trailing bytes of JOHAB can be less than 128. It's true that other server-supported encodings use only characters >= 128. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Backend encodings must have the property >> that all bytes of a multibyte character are >= 128. > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map. > Trailing bytes of JOHAB can be less than 128. In that case we must remove JOHAB from the list of allowed server encodings. Tatsuo, can you comment on whether this is correct? regards, tom lane
At Korea, Johab code is very old encondig. by the way, cp949 code page is really used in most environments. Personally speaking, Johab server code set is not need. I think that PostgreSQL supports UHC (cp949) server code set. This feature will be greet many Korean. :) Unfortunately, UHC code set have character sequences less then 128 byte. I tred to patch this problem, but this is not simply. I had gave up. :( ----- Original Message ----- From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> To: "Tom Lane" <tgl@sss.pgh.pa.us> Cc: <pgsql-hackers@postgresql.org> Sent: Monday, March 26, 2007 11:29 AM Subject: Re: [HACKERS] Server-side support of all encodings > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client >> > encodings, >> > but we cannot use them as server encodings. Are there any reason for >> > it? >> >> Very much so --- they aren't safe ASCII-supersets, and thus for example >> the parser will fail on them. Backend encodings must have the property >> that all bytes of a multibyte character are >= 128. > > But then, PG_JOHAB have already infringed it. Please see > johab_to_utf8.map. > Trailing bytes of JOHAB can be less than 128. > It's true that other server-supported encodings use only characters >= > 128. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate >
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Backend encodings must have the property > >> that all bytes of a multibyte character are >= 128. > > > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map. > > Trailing bytes of JOHAB can be less than 128. > > In that case we must remove JOHAB from the list of allowed server > encodings. Tatsuo, can you comment on whether this is correct? Sigh. From the first day when JOHAB was supported (back to 7.3 days), it should had not been in the server encodings. JOHAB's second byte definitely contain 0x41 and above. *johab*.map just reflect the fact. I think we should remove JOHAB from the server encodings list. I'm afraid users who have JOHAB encoded databases get angry, though. -- Tatsuo Ishii SRA OSS, Inc. Japan
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Backend encodings must have the property > >> that all bytes of a multibyte character are >= 128. > > > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map. > > Trailing bytes of JOHAB can be less than 128. > > In that case we must remove JOHAB from the list of allowed server > encodings. Tatsuo, can you comment on whether this is correct? Sigh. From the first day when JOHAB was supported (back to 7.3 days), it should had not been in the server encodings. JOHAB's second byte definitely contain 0x41 and above. *johab*.map just reflect the fact. I think we should remove JOHAB from the server encodings list. I'm afraid users who have JOHAB encoded databases get angry, though. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@sraoss.co.jp> writes: > Sigh. From the first day when JOHAB was supported (back to 7.3 days), > it should had not been in the server encodings. JOHAB's second byte > definitely contain 0x41 and above. *johab*.map just reflect the > fact. I think we should remove JOHAB from the server encodings list. > I'm afraid users who have JOHAB encoded databases get angry, though. I think the best way to proceed is probably to fix this in HEAD but not back-patch it. During a dump and reload the encoding can be corrected to something safe. regards, tom lane
Hello Everyone, I very much understand why SJIS is not a server encoding. It contains ASCII second bytes (including \ and ' both of which can be really nasty inside a normal sql) and further, half-width katakana is represented as one byte-characters, incidentally two of which coincide with a kanji. My question is, however: what would be the best practice if it was imperative to use SJIS encoding for texts and no built-in conversions are useful? To elaborate, I need to support japanese emoji characters, which are special emoticons for mobile phones. These characters are usually in a region that is not specified by the standard SJIS, therefore they are not properly converted either to EUC or UTF8 (which would be my prefered choice, but unfortunately not all mobile phones support it, so conversion is still necessary - from what i've seen, the new SJIS_2004 map seems to define these entities, but I'm not 100% sure they all get converted properly). I inherited a system in which this problem is "bypassed" by setting SQL_ASCII server encoding, but that is not the best solution (full text search is rendered useless and occasionally the special character issue rears its ugly head - not only do we have to deal with normal sqlinjection, but also encoding-based injections) (and for the real WTF, my predecessor converted everything to EUC before inserting - eventually losing all the emojis and creating all sorts of strange phenomena, like tables with one column in euc until a certain date and sjis from then on while euc for all other columns) Is there a way to properly deal with sjis+emoji extensions (a patch i'm not aware of, for example), is it considered as a todo for further releases or should i consider augmenting postgres in a way (if the latter, could you provide any pointers on how to proceed?) Thank you, Zaki -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane Sent: Monday, March 26, 2007 11:20 AM To: ITAGAKI Takahiro Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Server-side support of all encodings ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings, > but we cannot use them as server encodings. Are there any reason for it? Very much so --- they aren't safe ASCII-supersets, and thus for example the parser will fail on them. Backend encodings must have the property that all bytes of a multibyte character are >= 128. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypesdo not match
On Wed, Mar 28, 2007 at 10:44:00AM +0900, Dezso Zoltan wrote: > My question is, however: what would be the best practice if it was > imperative to use SJIS encoding for texts and no built-in conversions > are useful? To elaborate, I need to support japanese emoji characters, > which are special emoticons for mobile phones. These characters are > usually in a region that is not specified by the standard SJIS, Unicode also defines a region for user-defined characters. While it may be odd, perhaps it would be reasonable to map the user-defined space in SJIS to somewhere in the user-defined pace in Unicode, so that at least you round-trip consistancy. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
> Hello Everyone, > > I very much understand why SJIS is not a server encoding. It contains > ASCII second bytes (including \ and ' both of which can be really > nasty inside a normal sql) and further, half-width katakana is > represented as one byte-characters, incidentally two of which coincide > with a kanji. > > My question is, however: what would be the best practice if it was > imperative to use SJIS encoding for texts and no built-in conversions > are useful? To elaborate, I need to support japanese emoji characters, > which are special emoticons for mobile phones. These characters are > usually in a region that is not specified by the standard SJIS, > therefore they are not properly converted either to EUC or UTF8 (which > would be my prefered choice, but unfortunately not all mobile phones > support it, so conversion is still necessary - from what i've seen, > the new SJIS_2004 map seems to define these entities, but I'm not 100% > sure they all get converted properly). > > I inherited a system in which this problem is "bypassed" by setting > SQL_ASCII server encoding, but that is not the best solution (full > text search is rendered useless and occasionally the special character > issue rears its ugly head - not only do we have to deal with normal > sqlinjection, but also encoding-based injections) (and for the real > WTF, my predecessor converted everything to EUC before inserting - > eventually losing all the emojis and creating all sorts of strange > phenomena, like tables with one column in euc until a certain date and > sjis from then on while euc for all other columns) > > Is there a way to properly deal with sjis+emoji extensions (a patch > i'm not aware of, for example), is it considered as a todo for further > releases or should i consider augmenting postgres in a way (if the > latter, could you provide any pointers on how to proceed?) You can always use CREATE CONVERSION for this kind of purpose. Create your own conversion map between SJIS <--> EUC or UT-8. -- Tatsuo Ishii SRA OSS, Inc. Japan
> Hello Everyone, > > I very much understand why SJIS is not a server encoding. It contains > ASCII second bytes (including \ and ' both of which can be really > nasty inside a normal sql) and further, half-width katakana is > represented as one byte-characters, incidentally two of which coincide > with a kanji. > > My question is, however: what would be the best practice if it was > imperative to use SJIS encoding for texts and no built-in conversions > are useful? To elaborate, I need to support japanese emoji characters, > which are special emoticons for mobile phones. These characters are > usually in a region that is not specified by the standard SJIS, > therefore they are not properly converted either to EUC or UTF8 (which > would be my prefered choice, but unfortunately not all mobile phones > support it, so conversion is still necessary - from what i've seen, > the new SJIS_2004 map seems to define these entities, but I'm not 100% > sure they all get converted properly). > > I inherited a system in which this problem is "bypassed" by setting > SQL_ASCII server encoding, but that is not the best solution (full > text search is rendered useless and occasionally the special character > issue rears its ugly head - not only do we have to deal with normal > sqlinjection, but also encoding-based injections) (and for the real > WTF, my predecessor converted everything to EUC before inserting - > eventually losing all the emojis and creating all sorts of strange > phenomena, like tables with one column in euc until a certain date and > sjis from then on while euc for all other columns) > > Is there a way to properly deal with sjis+emoji extensions (a patch > i'm not aware of, for example), is it considered as a todo for further > releases or should i consider augmenting postgres in a way (if the > latter, could you provide any pointers on how to proceed?) You can always use CREATE CONVERSION for this kind of purpose. Create your own conversion map between SJIS <--> EUC or UT-8. -- Tatsuo Ishii SRA OSS, Inc. Japan
> Tatsuo Ishii <ishii@sraoss.co.jp> writes: > > Sigh. From the first day when JOHAB was supported (back to 7.3 days), > > it should had not been in the server encodings. JOHAB's second byte > > definitely contain 0x41 and above. *johab*.map just reflect the > > fact. I think we should remove JOHAB from the server encodings list. > > I'm afraid users who have JOHAB encoded databases get angry, though. > > I think the best way to proceed is probably to fix this in HEAD but > not back-patch it. During a dump and reload the encoding can be > corrected to something safe. Ok. Shall I go ahead and remove JOHAB in HEAD? -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes: >> I think the best way to proceed is probably to fix this in HEAD but >> not back-patch it. During a dump and reload the encoding can be >> corrected to something safe. > Ok. Shall I go ahead and remove JOHAB in HEAD? +1 for me. regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: > >> I think the best way to proceed is probably to fix this in HEAD but > >> not back-patch it. During a dump and reload the encoding can be > >> corrected to something safe. > > > Ok. Shall I go ahead and remove JOHAB in HEAD? > > +1 for me. > > regards, tom lane Done. BTW, do we have to modify pg_dump or pg_restore so that it can automatically adjust JOHAB to UTF8 (it's the only safe encoding compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe documenting in the release note is enough? I guess that there is 0 users who are using JOHAB. -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii <ishii@postgresql.org> writes: > BTW, do we have to modify pg_dump or pg_restore so that it can > automatically adjust JOHAB to UTF8 (it's the only safe encoding > compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe > documenting in the release note is enough? Do we actually need to do anything? Dumps taken in client_encoding JOHAB could exist regardless of the source server_encoding --- the same is true of other client-only encodings. Such dumps should load fine into a UTF8 server_encoding database, as long as we have the right conversion available. I can imagine someone wanting to take a dump in a client-only encoding for other reasons (export of the data to somewhere else, say) so I don't think pg_dump should try to prevent it. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: >> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings, >> but we cannot use them as server encodings. Are there any reason for it? > > Very much so --- they aren't safe ASCII-supersets, and thus for example > the parser will fail on them. Backend encodings must have the property > that all bytes of a multibyte character are >= 128. Sorry. I still cannot understand why backend encodings must have this property. AFAIK, the parser treats characters as ASCII. So any multi-byte characters will be treated as two or more ASCII characters. But if the multi-byte encoding doesnot use any special ASCII characters like single quote('), double quote(") and backslash(\), I think the parser can deal with it correctly. A quick search in src\backend\utils\mb\Unicode\*.map tells me that no encoding uses single quote or double quote, but JOHAB, GBK, GB18030, BIG5, SJIS use backslash. Since pgsql doesnot accept backslash as escape character in identity(double quoted string) or value(single quoted string) any more, I think the parser/scanner can process multi-bytes characters correctly. Thanks in advance. William ZHANG > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
"William ZHANG" <zedware@gmail.com> writes: > Sorry. I still cannot understand why backend encodings must have this > property. AFAIK, the parser treats characters as ASCII. So any multi-byte > characters will be treated as two or more ASCII characters. But if > the multi-byte encoding doesnot use any special ASCII characters like > single quote('), double quote(") and backslash(\), I think the parser > can deal with it correctly. You've got your attention too narrowly focused on strings inside quotes; it's strings outside quotes that are the problem. As an example, I see that gb18030 defines characters like 97 7e. If someone tried to use that as a character of a SQL identifier --- something that'd work fine for the UTF8 equivalent e6 a2 a1 --- the parser would see it as an identifier byte followed by the operator ~. Similarly, there are problems if we were to allow these character sets for the pattern argument of a regular expression operator, or for any datatype at all that can be embedded in an array constant. And for PL languages that feed prosrc strings into external interpreters, such as Perl or R, it gets really interesting really quickly :-(. It is possible that some of these encodings could be allowed without any risks, but I don't think it is worth our time to grovel through each valid character and every possible backend situation to determine safety. The risks are not always obvious --- see for instance the security holes we fixed about a year ago in 8.1.4 et al --- and so I for one would never have a lot of faith in there not being any holes. The rule "no ASCII-aliasing characters" is a simple one that we can have some confidence in. regards, tom lane