Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Nico Williams |
---|---|
Subject | Re: Pre-proposal: unicode normalized text |
Date | |
Msg-id | ZRx2VcsWomTBcE+L@ubby21 Whole thread Raw |
In response to | Re: Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Pre-proposal: unicode normalized text
|
List | pgsql-hackers |
On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote: > On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > > I think you misunderstand Unicode normalization and equivalence. > > There is no standard Unicode `normalize()` that would cause the > > above equality predicate to be true. If you normalize to NFD > > (normal form decomposed) then a _prefix_ of those two strings will > > be equal, but that's clearly not what you're looking for. Ugh, My client is not displying 'a' correctly, thus I misunderstood your post. > From [1]: Here's what you wrote in your post: | [...] But it's really the same | character with just a different representation, and if you normalize | them they are equal: | | SELECT normalize('á') = normalize('á'); -- true but my client is not displying 'a' correctly! (It displays like 'a' but it should display like 'á'.) Bah. So I'd (mis)interpreted you as saying that normalize('a') should equal normalize('á'). Please disregard that part of my reply. > > There are two ways to write 'á' in Unicode: one is pre-composed (one > > codepoint) and the other is decomposed (two codepoints in this > > specific case), and it would be nice to be able to preserve input > > form when storing strings but then still be able to index and match > > them form-insensitively (in the case of 'á' both equivalent > > representations should be considered equal, and for UNIQUE indexes > > they should be considered the same). > > Sometimes preserving input differences is a good thing, other times > it's not, depending on the context. Almost any data type has some > aspects of the input that might not be preserved -- leading zeros in a > number, or whitespace in jsonb, etc. Almost every Latin input mode out there produces precomposed characters and so they effectively produce NFC. I'm not sure if the same is true for, e.g., Hangul (Korean) and various other scripts. But there are things out there that produce NFD. Famously Apple's HFS+ uses NFD (or something very close to NFD). So if you cut-n-paste things that got normalized to NFD and paste them into contexts where normalization isn't done, then you might start wanting to alter those contexts to either normalize or be form-preserving/form-insensitive. Sometimes you don't get to normalize, so you have to pick form- preserving/form-insensitive behavior. > If text is stored as normalized with NFC, it could be frustrating if > the retrieved string has a different binary representation than the > source data. But it could also be frustrating to look at two strings > made up of ordinary characters that look identical and for the database > to consider them unequal. Exactly. If you have such a case you might like the option to make your database form-preserving and form-insensitive. That means that indices need to normalize strings, but tables need to store unnormalized strings. ZFS (filesystems are a bit like databases) does just that! Nico --
pgsql-hackers by date: