Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Pre-proposal: unicode normalized text |
Date | |
Msg-id | b870285789a03a7e6ef298ba3adaf9436b829c2e.camel@j-davis.com Whole thread Raw |
In response to | Re: Pre-proposal: unicode normalized text (Isaac Morland <isaac.morland@gmail.com>) |
List | pgsql-hackers |
On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote: > In the case you describe, the users don’t have text at all; they have > bytes, and a vague belief about what encoding the bytes might be in > and therefore what characters they are intended to represent. The > correct way to store that in the database is using bytea. I wouldn't be so absolute. It's text data to the user, and is presumably working fine for them now, and if they switched to bytea today then 'foo' would show up as '\x666f6f' in psql. The point is that this is a somewhat messy problem because there's so much software out there that treats byte strings and textual data interchangably. Rust goes the extra mile to organize all of this, and it ends up with: * String -- always UTF-8, never NUL-terminated * CString -- NUL-terminated byte sequence with no internal NULs * OsString[3] -- needed to make a Path[4], which is needed to open a file[5] * Vec<u8> -- any byte sequence and I suppose we could work towards offering better support for these different types, the casts between them, and delivering them in a form the client can understand. But I wouldn't describe it as a solved problem with one "correct" solution. One takeaway from this discussion is that it would be useful to provide more flexibility in how values are represented to the client in a more general way. In addition to encoding, representational issues have come up with binary formats, bytea, extra_float_digits, etc. The collection of books by CJ Date & Hugh Darwen, et al. (sorry I don't remember exactly which books), made the theoretical case for explicitly distinguishing values from representations at the lanugage level. We're starting to see that representational issues can't be satisfied with a few special cases and hacks -- it's worth thinking about a general solution to that problem. There was also a lot of relevant discussion about how to think about overlapping domains (e.g. ASCII is valid in any of these text domains). > Text types should be for when you know what characters you want to > store. In this scenario, the implementation detail of what encoding > the database uses internally to write the data on the disk doesn't > matter, any more than it matters to a casual user how a table is > stored on disk. Perhaps the user and application do know, and there's some kind of subtlety that we're missing, or some historical artefact that we're not accounting for, and that somehow makes UTF-8 unsuitable. Surely there are applications that treat certain byte sequences in non-standard ways, and perhaps not all of those byte sequences can be reproduced by transcoding from UTF-8 to the client_encoding. In any case, I would want to understand in detail why a user thinks UTF8 is not good enough before I make too strong of a statement here. Even the terminal font that I use renders some "identical" unicode characters slightly differently depending on the code points from which they are composed. I believe that's an intentional convenience to make it more apparent why the "diff" command (or other byte-based tool) is showing a difference between two textually identical strings, but it's also a violation of unicode. (This is another reason why normalization might not be for everyone, but I believe it's still good in typical cases.) Regards, Jeff Davis
pgsql-hackers by date: