Character sets (Re: Re: Big 7.1 open items) - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Character sets (Re: Re: Big 7.1 open items) |
Date | |
Msg-id | Pine.LNX.4.21.0006200102490.353-100000@localhost.localdomain Whole thread Raw |
In response to | Re: Re: Big 7.1 open items (Thomas Lockhart <lockhart@alumni.caltech.edu>) |
Responses |
Re: Character sets (Re: Re: Big 7.1 open items)
|
List | pgsql-hackers |
Thomas Lockhart writes: > One issue: I can see (or imagine ;) how we can use the Postgres type > system to manage multiple character sets. But how are you going to tell a genuine "type" from a character set? And you might have to have three types for each charset. There'd be a lot of redundancy and confusion regarding the input and output functions and other pg_type attributes. No doubt there's something to be learned from the type system, but character sets have different properties -- like characters(!), collation rules, encoding "translations" and what not. There is no doubt also need for different error handling. So I think that just dumping every character set into pg_type is not a good idea. That's almost equivalent to having separate types for char(6), char(7), etc. Instead, I'd suggest that character sets become separate objects. A character entity would carry around its character set in its header somehow. Consider a string concatenation function, being invoked with two arguments of the same exotic character set. Using the type system only you'd have to either provide a function signature for all combinations of characters sets or you'd have to cast them up to SQL_TEXT, concatenate them and cast them back to the original charset. A smarter concatentation function instead might notice that both arguments are of the same character set and simply paste them together right there. > But allowing arbitrary character sets in, say, table names forces us > to cope with allowing a mix of character sets in a single column of a > system table. The priority is probably the data people store, not the way they get to name their tables. > Would it be acceptable to have a "default database character set" > which is allowed to creep into the pg_xxx tables? I think we could go with making all system table char columns Unicode, but of course they are really of the "name" type, which is another issue completely. > We should itemize all of these issues so we can keep track of what is > necessary, possible, and/or "easy". Here are a couple of "items" I keep wondering about: * To what extend would we be able to use the operating systems locale facilities? Besides the fact that some systems are deficient or broken one way or another, POSIX really doesn't provide much besides "given two strings, which one is greater", and then only on a per-process basis. We'd really need more that, see also LIKE indexing issues, and indexing in general. * Client support: A lot of language environments provide pretty smooth Unicode support these days, e.g., Java, Perl 5.6, and I think that C99 has also made some strides. So while "we can store stuff in any character set you want" is great, it's really no good if it doesn't work transparently with the client interfaces. At least something to keep in mind. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
pgsql-hackers by date: