Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From Nico Williams
Subject Re: Pre-proposal: unicode normalized text
Date
Msg-id ZUQk99GFe787WC/z@ubby21
Whole thread Raw
In response to Re: Pre-proposal: unicode normalized text  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote:
> > Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
> > it's whatever the database's encoding is.  Collation can and should be a
> > porperty of a column, since for Unicode it wouldn't be reasonable to
> > make that part of the type.  But codeset+encoding should really be a
> > property of the type if PG were to support more than one.  IMO.
> 
> No, what I mean is, you can't just be like "oh, the varlena will be
> different in memory than on disk" as if that were no big deal.

It would have to be the same in memory as on disk, indeed, but you might
need new types in C as well for that.

> I agree that, as an alternative to encoding being a column property,
> it could instead be completely a type property, meaning that if you
> want to store, say, LATIN1 text in your UTF-8 database, you first
> create a latint1text data type and then use it, rather than, as in the
> model I proposed, creating a text column and then applying a setting
> like ENCODING latin1 to it. I think that there might be some problems

Yes, that was the idea.

> with that model, but it could also have some benefits. [...]

Mainly, I think, whether you want PG to do automatic codeset conversions
(ugly and problematic) or not, like for when using text functions.

Automatic codeset conversions are problematic because a) it can be lossy
(so what to do when it is?) and b) automatic type conversions can be
surprising.

Ultimately the client would have to do its own codeset conversions, if
it wants them, or treat text in codesets other than its local one as
blobs and leave it for a higher app layer to deal with.

I wouldn't want to propose automatic codeset conversions.  If you'd want
that then you might as well declare it has to all be UTF-8 and say no to
any other codesets.

> But, even if we were all convinced that this kind of feature was good
> to add, I think it would almost certainly be wrong to invent new
> varlena features along the way.

Yes.

Nico
-- 



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Remove distprep
Next
From: Nico Williams
Date:
Subject: Re: Pre-proposal: unicode normalized text