Character sets (Re: Re: Big 7.1 open items) - Mailing list pgsql-hackers

From Peter Eisentraut
Subject Character sets (Re: Re: Big 7.1 open items)
Date
Msg-id Pine.LNX.4.21.0006200102490.353-100000@localhost.localdomain
Whole thread Raw
In response to Re: Re: Big 7.1 open items  (Thomas Lockhart <lockhart@alumni.caltech.edu>)
Responses Re: Character sets (Re: Re: Big 7.1 open items)
List pgsql-hackers
Thomas Lockhart writes:

> One issue: I can see (or imagine ;) how we can use the Postgres type
> system to manage multiple character sets.

But how are you going to tell a genuine "type" from a character set? And
you might have to have three types for each charset. There'd be a lot of
redundancy and confusion regarding the input and output functions and
other pg_type attributes. No doubt there's something to be learned from
the type system, but character sets have different properties -- like
characters(!), collation rules, encoding "translations" and what not.
There is no doubt also need for different error handling. So I think that
just dumping every character set into pg_type is not a good idea. That's
almost equivalent to having separate types for char(6), char(7), etc.

Instead, I'd suggest that character sets become separate objects. A
character entity would carry around its character set in its header
somehow. Consider a string concatenation function, being invoked with two
arguments of the same exotic character set. Using the type system only
you'd have to either provide a function signature for all combinations of
characters sets or you'd have to cast them up to SQL_TEXT, concatenate
them and cast them back to the original charset. A smarter concatentation
function instead might notice that both arguments are of the same
character set and simply paste them together right there.


> But allowing arbitrary character sets in, say, table names forces us
> to cope with allowing a mix of character sets in a single column of a
> system table.

The priority is probably the data people store, not the way they get to
name their tables.

> Would it be acceptable to have a "default database character set"
> which is allowed to creep into the pg_xxx tables?

I think we could go with making all system table char columns Unicode, but
of course they are really of the "name" type, which is another issue
completely.


> We should itemize all of these issues so we can keep track of what is
> necessary, possible, and/or "easy".

Here are a couple of "items" I keep wondering about:

* To what extend would we be able to use the operating systems locale
facilities? Besides the fact that some systems are deficient or broken one
way or another, POSIX really doesn't provide much besides "given two
strings, which one is greater", and then only on a per-process basis.
We'd really need more that, see also LIKE indexing issues, and indexing in
general.

* Client support: A lot of language environments provide pretty smooth
Unicode support these days, e.g., Java, Perl 5.6, and I think that C99 has
also made some strides. So while "we can store stuff in any character set
you want" is great, it's really no good if it doesn't work transparently
with the client interfaces. At least something to keep in mind.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: CREATE GROUP oddity
Next
From: Peter Eisentraut
Date:
Subject: Re: Big 7.1 open items