Re: Server-side support of all encodings - Mailing list pgsql-hackers

From Dezso Zoltan
Subject Re: Server-side support of all encodings
Date
Msg-id 7568ba740703271844k69050a61g7e0f6da17e5a4240@mail.gmail.com
Whole thread Raw
In response to Server-side support of all encodings  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
Responses Re: Server-side support of all encodings  (Martijn van Oosterhout <kleptog@svana.org>)
Re: Server-side support of all encodings  (Tatsuo Ishii <ishii@sraoss.co.jp>)
Re: Server-side support of all encodings  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
Hello Everyone,

I very much understand why SJIS is not a server encoding. It contains
ASCII second bytes (including \ and ' both of which can be really
nasty inside a normal sql) and further, half-width katakana is
represented as one byte-characters, incidentally two of which coincide
with a kanji.

My question is, however: what would be the best practice if it was
imperative to use SJIS encoding for texts and no built-in conversions
are useful? To elaborate, I need to support japanese emoji characters,
which are special emoticons for mobile phones. These characters are
usually in a region that is not specified by the standard SJIS,
therefore they are not properly converted either to EUC or UTF8 (which
would be my prefered choice, but unfortunately not all mobile phones
support it, so conversion is still necessary - from what i've seen,
the new SJIS_2004 map seems to define these entities, but I'm not 100%
sure they all get converted properly).

I inherited a system in which this problem is "bypassed" by setting
SQL_ASCII server encoding, but that is not the best solution (full
text search is rendered useless and occasionally the special character
issue rears its ugly head - not only do we have to deal with normal
sqlinjection, but also encoding-based injections) (and for the real
WTF, my predecessor converted everything to EUC before inserting -
eventually losing all the emojis and creating all sorts of strange
phenomena, like tables with one column in euc until a certain date and
sjis from then on while euc for all other columns)

Is there a way to properly deal with sjis+emoji extensions (a patch
i'm not aware of, for example), is it considered as a todo for further
releases or should i consider augmenting postgres in a way (if the
latter, could you provide any pointers on how to proceed?)

Thank you,
Zaki

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Monday, March 26, 2007 11:20 AM
To: ITAGAKI Takahiro
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Server-side support of all encodings

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> but we cannot use them as server encodings. Are there any reason for it?

Very much so --- they aren't safe ASCII-supersets, and thus for example
the parser will fail on them.  Backend encodings must have the property
that all bytes of a multibyte character are >= 128.
        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to      choose an index scan if your joining column's
datatypesdo not      match
 


pgsql-hackers by date:

Previous
From: "Sailesh Krishnamurthy"
Date:
Subject: Re: Concurrent connections in psql
Next
From: Josh Berkus
Date:
Subject: Re: [PATCHES] Full page writes improvement, code update