Thread: Pre-proposal: unicode normalized text
One of the frustrations with using the "C" locale (or any deterministic locale) is that the following returns false: SELECT 'á' = 'á'; -- false because those are the unicode sequences U&'\0061\0301' and U&'\00E1', respectively, so memcmp() returns non-zero. But it's really the same character with just a different representation, and if you normalize them they are equal: SELECT normalize('á') = normalize('á'); -- true The idea is to have a new data type, say "UTEXT", that normalizes the input so that it can have an improved notion of equality while still using memcmp(). Unicode guarantees that "the results of normalizing a string on one version will always be the same as normalizing it on any other version, as long as the string contains only assigned characters according to both versions"[1]. It also guarantees that it "will not reallocate, remove, or reassign" characters[2]. That means that we can normalize in a forward-compatible way as long as we don't allow the use of unassigned code points. I looked at the standard to see what it had to say, and is discusses normalization, but a standard UCS string with an unassigned code point is not an error. Without a data type to enforce the constraint that there are no unassigned code points, we can't guarantee forward compatibility. Some other systems support NVARCHAR, but I didn't see any guarantee of normalization or blocking unassigned code points there, either. UTEXT benefits: * slightly better natural language semantics than TEXT with deterministic collation * still deterministic=true * fast memcmp()-based comparisons * no breaking semantic changes as unicode evolves TEXT allows unassigned code points, and generally returns the same byte sequences that were orgiinally entered; therefore UTEXT is not a replacement for TEXT. UTEXT could be built-in or it could be an extension or in contrib. If an extension, we'd probably want to at least expose a function that can detect unassigned code points, so that it's easy to be consistent with the auto-generated unicode tables. I also notice that there already is an unassigned code points table in saslprep.c, but it seems to be frozen as of Unicode 3.2, and I'm not sure why. Questions: * Would this be useful enough to justify a new data type? Would it be confusing about when to choose one versus the other? * Would cross-type comparisons between TEXT and UTEXT become a major problem that would reduce the utility? * Should "some_utext_value = some_text_value" coerce the LHS to TEXT or the RHS to UTEXT? * Other comments or am I missing something? Regards, Jeff Davis [1] https://unicode.org/reports/tr15/ [2] https://www.unicode.org/policies/stability_policy.html
On 13.09.23 00:47, Jeff Davis wrote: > The idea is to have a new data type, say "UTEXT", that normalizes the > input so that it can have an improved notion of equality while still > using memcmp(). I think a new type like this would obviously be suboptimal because it's nonstandard and most people wouldn't use it. I think a better direction here would be to work toward making nondeterministic collations usable on the global/database level and then encouraging users to use those. It's also not clear which way the performance tradeoffs would fall. Nondeterministic collations are obviously going to be slower, but by how much? People have accepted moving from C locale to "real" locales because they needed those semantics. Would it be any worse moving from real locales to "even realer" locales? On the other hand, a utext type would either require a large set of its own functions and operators, or you would have to inject text-to-utext casts in places, which would also introduce overhead.
On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut <peter@eisentraut.org> wrote: > I think a better direction here would be to work toward making > nondeterministic collations usable on the global/database level and then > encouraging users to use those. It seems to me that this overlooks one of the major points of Jeff's proposal, which is that we don't reject text input that contains unassigned code points. That decision turns out to be really painful. Here, Jeff mentions normalization, but I think it's a major issue with collation support. If new code points are added, users can put them into the database before they are known to the collation library, and then when they become known to the collation library the sort order changes and indexes break. Would we endorse a proposal to make pg_catalog.text with encoding UTF-8 reject code points that aren't yet known to the collation library? To do so would be tighten things up considerably from where they stand today, and the way things stand today is already rigid enough to cause problems for some users. But if we're not willing to do that then I find it easy to understand why Jeff wants an alternative type that does. Now, there is still the question of whether such a data type would properly belong in core or even contrib rather than being an out-of-core project. It's not obvious to me that such a data type would get enough traction that we'd want it to be part of PostgreSQL itself. But at the same time I can certainly understand why Jeff finds the status quo problematic. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote: > One of the frustrations with using the "C" locale (or any deterministic > locale) is that the following returns false: > > SELECT 'á' = 'á'; -- false > > because those are the unicode sequences U&'\0061\0301' and U&'\00E1', > respectively, so memcmp() returns non-zero. But it's really the same > character with just a different representation, and if you normalize > them they are equal: > > SELECT normalize('á') = normalize('á'); -- true I think you misunderstand Unicode normalization and equivalence. There is no standard Unicode `normalize()` that would cause the above equality predicate to be true. If you normalize to NFD (normal form decomposed) then a _prefix_ of those two strings will be equal, but that's clearly not what you're looking for. PostgreSQL already has Unicode normalization support, though it would be nice to also have form-insensitive indexing and equality predicates. There are two ways to write 'á' in Unicode: one is pre-composed (one codepoint) and the other is decomposed (two codepoints in this specific case), and it would be nice to be able to preserve input form when storing strings but then still be able to index and match them form-insensitively (in the case of 'á' both equivalent representations should be considered equal, and for UNIQUE indexes they should be considered the same). You could also have functions that perform lossy normalization in the sort of way that soundex does, such as first normalizing to NFD then dropping all combining codepoints which then could allow 'á' to be eq to 'a'. But this would not be a Unicode normalization function. Nico --
On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > I think you misunderstand Unicode normalization and equivalence. > There > is no standard Unicode `normalize()` that would cause the above > equality > predicate to be true. If you normalize to NFD (normal form > decomposed) > then a _prefix_ of those two strings will be equal, but that's > clearly > not what you're looking for. From [1]: "Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other. Depending on the particular Unicode Normalization Form, that equivalence can either be a canonical equivalence or a compatibility equivalence... A binary comparison of the transformed strings will then determine equivalence." NFC and NFD are based on Canonical Equivalence. "Canonical equivalence is a fundamental equivalency between characters or sequences of characters which represent the same abstract character, and which when correctly displayed should always have the same visual appearance and behavior." Can you explain why NFC (the default form of normalization used by the postgres normalize() function), followed by memcmp(), is not the right thing to use to determine Canonical Equivalence? Or are you saying that Canonical Equivalence is not a useful thing to test? What do you mean about the "prefix"? In Postgres today: SELECT normalize(U&'\0061\0301', nfc)::bytea; -- \xc3a1 SELECT normalize(U&'\00E1', nfc)::bytea; -- \xc3a1 SELECT normalize(U&'\0061\0301', nfd)::bytea; -- \x61cc81 SELECT normalize(U&'\00E1', nfd)::bytea; -- \x61cc81 which looks useful to me, but I assume you are saying that it doesn't generalize well to other cases? [1] https://unicode.org/reports/tr15/ > There are two ways to write 'á' in Unicode: one is pre-composed (one > codepoint) and the other is decomposed (two codepoints in this > specific > case), and it would be nice to be able to preserve input form when > storing strings but then still be able to index and match them > form-insensitively (in the case of 'á' both equivalent > representations > should be considered equal, and for UNIQUE indexes they should be > considered the same). Sometimes preserving input differences is a good thing, other times it's not, depending on the context. Almost any data type has some aspects of the input that might not be preserved -- leading zeros in a number, or whitespace in jsonb, etc. If text is stored as normalized with NFC, it could be frustrating if the retrieved string has a different binary representation than the source data. But it could also be frustrating to look at two strings made up of ordinary characters that look identical and for the database to consider them unequal. Regards, Jeff Davis
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Yeah, because we lose forward-compatibility of some useful operations. > Here, Jeff mentions normalization, but I think it's a major issue > with > collation support. If new code points are added, users can put them > into the database before they are known to the collation library, and > then when they become known to the collation library the sort order > changes and indexes break. The collation version number may reflect the change in understanding about assigned code points that may affect collation -- though I'd like to understand whether this is guaranteed or not. Regardless, given that (a) we don't have a good story for migrating to new collation versions; and (b) it would be painful to rebuild indexes even if we did; then you are right that it's a problem. > Would we endorse a proposal to make > pg_catalog.text with encoding UTF-8 reject code points that aren't > yet > known to the collation library? To do so would be tighten things up > considerably from where they stand today, and the way things stand > today is already rigid enough to cause problems for some users. What problems exist today due to the rigidity of text? I assume you mean because we reject invalid byte sequences? Yeah, I'm sure that causes a problem for some (especially migrations), but it's difficult for me to imagine a database working well with no rules at all for the the basic data types. > Now, there is still the question of whether such a data type would > properly belong in core or even contrib rather than being an > out-of-core project. It's not obvious to me that such a data type > would get enough traction that we'd want it to be part of PostgreSQL > itself. At minimum I think we need to have some internal functions to check for unassigned code points. That belongs in core, because we generate the unicode tables from a specific version. I also think we should expose some SQL functions to check for unassigned code points. That sounds useful, especially since we already expose normalization functions. One could easily imagine a domain with CHECK(NOT contains_unassigned(a)). Or an extension with a data type that uses the internal functions. Whether we ever get to a core data type -- and more importantly, whether anyone uses it -- I'm not sure. > But at the same time I can certainly understand why Jeff finds > the status quo problematic. Yeah, I am looking for a better compromise between: * everything is memcmp() and 'á' sometimes doesn't equal 'á' (depending on code point sequence) * everything is constantly changing, indexes break, and text comparisons are slow A stable idea of unicode normalization based on using only assigned code points is very tempting. Regards, Jeff Davis
On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote: > On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote: > > I think you misunderstand Unicode normalization and equivalence. > > There is no standard Unicode `normalize()` that would cause the > > above equality predicate to be true. If you normalize to NFD > > (normal form decomposed) then a _prefix_ of those two strings will > > be equal, but that's clearly not what you're looking for. Ugh, My client is not displying 'a' correctly, thus I misunderstood your post. > From [1]: Here's what you wrote in your post: | [...] But it's really the same | character with just a different representation, and if you normalize | them they are equal: | | SELECT normalize('á') = normalize('á'); -- true but my client is not displying 'a' correctly! (It displays like 'a' but it should display like 'á'.) Bah. So I'd (mis)interpreted you as saying that normalize('a') should equal normalize('á'). Please disregard that part of my reply. > > There are two ways to write 'á' in Unicode: one is pre-composed (one > > codepoint) and the other is decomposed (two codepoints in this > > specific case), and it would be nice to be able to preserve input > > form when storing strings but then still be able to index and match > > them form-insensitively (in the case of 'á' both equivalent > > representations should be considered equal, and for UNIQUE indexes > > they should be considered the same). > > Sometimes preserving input differences is a good thing, other times > it's not, depending on the context. Almost any data type has some > aspects of the input that might not be preserved -- leading zeros in a > number, or whitespace in jsonb, etc. Almost every Latin input mode out there produces precomposed characters and so they effectively produce NFC. I'm not sure if the same is true for, e.g., Hangul (Korean) and various other scripts. But there are things out there that produce NFD. Famously Apple's HFS+ uses NFD (or something very close to NFD). So if you cut-n-paste things that got normalized to NFD and paste them into contexts where normalization isn't done, then you might start wanting to alter those contexts to either normalize or be form-preserving/form-insensitive. Sometimes you don't get to normalize, so you have to pick form- preserving/form-insensitive behavior. > If text is stored as normalized with NFC, it could be frustrating if > the retrieved string has a different binary representation than the > source data. But it could also be frustrating to look at two strings > made up of ordinary characters that look identical and for the database > to consider them unequal. Exactly. If you have such a case you might like the option to make your database form-preserving and form-insensitive. That means that indices need to normalize strings, but tables need to store unnormalized strings. ZFS (filesystems are a bit like databases) does just that! Nico --
On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote: > Ugh, My client is not displying 'a' correctly Ugh. Is that an argument in favor of normalization or against? I've also noticed that some fonts render the same character a bit differently depending on the constituent code points. For instance, if the accent is its own code point, it seems to be more prominent than if a single code point represents both the base character and the accent. That seems to be a violation, but I can understand why that might be useful. > > Almost every Latin input mode out there produces precomposed > characters > and so they effectively produce NFC. The problem is not the normal case, the problem will be things like obscure input methods, some kind of software that's being too clever, or some kind of malicious user trying to confuse the database. > > That means that indices > need to normalize strings, but tables need to store unnormalized > strings. That's an interesting idea. Would the equality operator normalize first, or are you saying that the index would need to recheck the results? Regards, Jeff Davis
On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote: > I think a better direction here would be to work toward making > nondeterministic collations usable on the global/database level and > then > encouraging users to use those. > > It's also not clear which way the performance tradeoffs would fall. > > Nondeterministic collations are obviously going to be slower, but by > how > much? People have accepted moving from C locale to "real" locales > because they needed those semantics. Would it be any worse moving > from > real locales to "even realer" locales? If you normalize first, then you can get some semantic improvements without giving up on the stability and performance of memcmp(). That seems like a win with zero costs in terms of stability or performance (except perhaps some extra text->utext casts). Going to a "real" locale gives more semantic benefits but at a very high cost: depending on a collation provider library, dealing with collation changes, and performance costs. While supporting the use of nondeterministic collations at the database level may be a good idea, it's not helping to reach the compromise that I'm trying to reach in this thread. Regards, Jeff Davis
On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote: > On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote: > > Ugh, My client is not displying 'a' correctly > > Ugh. Is that an argument in favor of normalization or against? Heheh, well, it's an argument in favor of more software getting this right (darn it). It's also an argument for building a time machine so HFS+ can just always have used NFC. But the existence of UTF-16 is proof that time machines don't exist (or that only bad actors have them). > I've also noticed that some fonts render the same character a bit > differently depending on the constituent code points. For instance, if > the accent is its own code point, it seems to be more prominent than if > a single code point represents both the base character and the accent. > That seems to be a violation, but I can understand why that might be > useful. Yes, that happens. Did you know that the ASCII character set was designed with overstrike in mind for typing of accented Latin characters? Unicode combining sequences are kinda like that, but more complex. Yes, the idea really was that you could write a<BS>' (or '<BS>a) to get á. That's how people did it with typewriters anyways. > > Almost every Latin input mode out there produces precomposed > > characters and so they effectively produce NFC. > > The problem is not the normal case, the problem will be things like > obscure input methods, some kind of software that's being too clever, > or some kind of malicious user trying to confuse the database. _HFS+ enters the chat_ > > That means that indices > > need to normalize strings, but tables need to store unnormalized > > strings. > > That's an interesting idea. Would the equality operator normalize > first, or are you saying that the index would need to recheck the > results? You can optimize this to avoid having to normalize first. Most strings are not equal, and they tend to differ early. And most strings will likely be ASCII-mostly or in the same form anyways. So you can just walk a cursor down each string looking at two bytes, and if they are both ASCII then you move each cursor forward by one byte, and if then are not both ASCII then you take a slow path where you normalize one grapheme cluster at each cursor (if necessary) and compare that. (ZFS does this.) You can also assume ASCII-mostly, load as many bits of each string (padding as needed) as will fit in SIMD registers, compare and check that they're all ASCII, and if not then jump to the slow path. You can also normalize one grapheme cluster at a time when hashing (e.g., for hash indices), thus avoiding a large allocation if the string is large. Nico --
On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql@j-davis.com> wrote: > I assume you mean because we reject invalid byte sequences? Yeah, I'm > sure that causes a problem for some (especially migrations), but it's > difficult for me to imagine a database working well with no rules at > all for the the basic data types. There's a very popular commercial database where, or so I have been led to believe, any byte sequence at all is accepted when you try to put values into the database. The rumors I've heard -- I have not played with it myself -- are that when you try to do anything, byte sequences that are not valid in the configured encoding are treated as single-byte characters or something of that sort. So like if you had UTF-8 as the encoding and the first byte of the string is something that can only appear as a continuation byte in UTF-8, I think that byte is just treated as a separate character. I don't quite know how you make all of the operations work that way, but it seems like they've come up with a somewhat-consistent set of principles that are applied across the board. Very different from the PG philosophy, of course. And I'm not saying it's better. But it does eliminate the problem of being unable to load data into the database, because in such a model there's no such thing as invalidly-encoded data. Instead, an encoding like UTF-8 is effectively extended so that every byte sequence represents *something*. Whether that something is what you wanted is another story. At any rate, if we were to go in the direction of rejecting code points that aren't yet assigned, or aren't yet known to the collation library, that's another way for data loading to fail. Which feels like very defensible behavior, but not what everyone wants, or is used to. > At minimum I think we need to have some internal functions to check for > unassigned code points. That belongs in core, because we generate the > unicode tables from a specific version. That's a good idea. > I also think we should expose some SQL functions to check for > unassigned code points. That sounds useful, especially since we already > expose normalization functions. That's a good idea, too. > One could easily imagine a domain with CHECK(NOT > contains_unassigned(a)). Or an extension with a data type that uses the > internal functions. Yeah. > Whether we ever get to a core data type -- and more importantly, > whether anyone uses it -- I'm not sure. Same here. > Yeah, I am looking for a better compromise between: > > * everything is memcmp() and 'á' sometimes doesn't equal 'á' > (depending on code point sequence) > * everything is constantly changing, indexes break, and text > comparisons are slow > > A stable idea of unicode normalization based on using only assigned > code points is very tempting. The fact that there are multiple types of normalization and multiple notions of equality doesn't make this easier. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote: > The idea is to have a new data type, say "UTEXT", that normalizes the > input so that it can have an improved notion of equality while still > using memcmp(). A UTEXT type would be helpful for specifying that the text must be Unicode (in which transform?) even if the character data encoding for the database is not UTF-8. Maybe UTF8 might be a better name for the new type, since it would denote the transform (and would allow for UTF16 and UTF32 some day, though it's doubtful those would ever happen). But it's one thing to specify Unicode (and transform) in the type and another to specify an NF to normalize to on insert or on lookup. How about new column constraint keywords, such as NORMALIZE (meaning normalize on insert) and NORMALIZED (meaning reject non-canonical form text), with an optional parenthetical by which to specify a non-default form? (These would apply to TEXT as well when the default encoding for the DB is UTF-8.) One could then ALTER TABLE to add this to existing tables. This would also make it easier to add a form-preserving/form-insensitive mode later if it turns out to be useful or necessary, maybe making it the default for Unicode text in new tables. > Questions: > > * Would this be useful enough to justify a new data type? Would it be > confusing about when to choose one versus the other? Yes. See above. I think I'd rather have it be called UTF8, and the normalization properties of it to be specified as column constraints. > * Would cross-type comparisons between TEXT and UTEXT become a major > problem that would reduce the utility? Maybe when the database's encoding is UTF_8 then UTEXT (or UTF8) can be an alias of TEXT. > * Should "some_utext_value = some_text_value" coerce the LHS to TEXT > or the RHS to UTEXT? Ooh, this is nice! If the TEXT is _not_ UTF-8 then it could be converted to UTF-8. So I think which is RHS and which is LHS doesn't matter -- it's which is UTF-8, and if both are then the only thing left to do is normalize, and for that I'd take the LHS' form if the LHS is UTF-8, else the RHS'. Nico --
On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com> wrote: > A UTEXT type would be helpful for specifying that the text must be > Unicode (in which transform?) even if the character data encoding for > the database is not UTF-8. That's actually pretty thorny ... because right now client_encoding specifies the encoding to be used for all data sent to the client. So would we convert the data from UTF8 to the selected client encoding? Or what? -- Robert Haas EDB: http://www.enterprisedb.com
On 2023-10-04 13:47, Robert Haas wrote: > On Wed, Oct 4, 2023 at 1:27 PM Nico Williams <nico@cryptonector.com> > wrote: >> A UTEXT type would be helpful for specifying that the text must be >> Unicode (in which transform?) even if the character data encoding for >> the database is not UTF-8. > > That's actually pretty thorny ... because right now client_encoding > specifies the encoding to be used for all data sent to the client. So > would we convert the data from UTF8 to the selected client encoding? The SQL standard would have me able to: CREATE TABLE foo ( a CHARACTER VARYING CHARACTER SET UTF8, b CHARACTER VARYING CHARACTER SET LATIN1 ) and so on, and write character literals like _UTF8'Hello, world!' and _LATIN1'Hello, world!' and have those columns and data types independently contain what they can contain, without constraints imposed by one overall database encoding. Obviously, we're far from being able to do that. But should it become desirable to get closer, would it be worthwhile to also try to follow how the standard would have it look? Clearly, part of the job would involve making the wire protocol able to transmit binary values and identify their encodings. Regards, -Chap
On Wed, Oct 4, 2023 at 2:02 PM Chapman Flack <chap@anastigmatix.net> wrote: > Clearly, part of the job would involve making the wire protocol > able to transmit binary values and identify their encodings. Right. Which unfortunately is moving the goal posts into the stratosphere compared to any other work mentioned so far. I agree it would be great. But not if you want concrete progress any time soon. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 4 Oct 2023 at 14:05, Chapman Flack <chap@anastigmatix.net> wrote:
On 2023-10-04 13:47, Robert Haas wrote:
The SQL standard would have me able to:
CREATE TABLE foo (
a CHARACTER VARYING CHARACTER SET UTF8,
b CHARACTER VARYING CHARACTER SET LATIN1
)
and so on, and write character literals like
_UTF8'Hello, world!' and _LATIN1'Hello, world!'
and have those columns and data types independently contain what
they can contain, without constraints imposed by one overall
database encoding.
Obviously, we're far from being able to do that. But should it
become desirable to get closer, would it be worthwhile to also
try to follow how the standard would have it look?
Clearly, part of the job would involve making the wire protocol
able to transmit binary values and identify their encodings.
I would go in the other direction (note: I’m ignoring all backward compatibility considerations related to the current design of Postgres).
Always store only UTF-8 in the database, and send only UTF-8 on the wire protocol. If we still want to have a concept of "client encoding", have the client libpq take care of translating the bytes between the bytes used by the caller and the bytes sent on the wire.
Note that you could still define columns as you say, but the character set specification would effectively act simply as a CHECK constraint on the characters allowed, essentially CHECK (column_name ~ '^[...all characters in encoding...]$*'). We don't allow different on-disk representations of dates or other data types; except when we really need to, and then we have multiple data types (e.g. int vs. float) rather than different ways of storing the same datatype.
What about characters not in UTF-8? If a character is important enough for us to worry about in Postgres, it’s important enough to get a U+ number from the Unicode Consortium, which automatically puts it in UTF-8. In the modern context, "plain text" mean "UTF-8 encoded text", as far as I'm concerned.
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: > any byte sequence at all is accepted when you try to > put values into the database. We support SQL_ASCII, which allows something similar. > At any rate, if we were to go in the direction of rejecting code > points that aren't yet assigned, or aren't yet known to the collation > library, that's another way for data loading to fail. A failure during data loading is either a feature or a bug, depending on whether you are the one loading the data or the one trying to make sense of it later ;-) > Which feels like > very defensible behavior, but not what everyone wants, or is used to. Yeah, there are many reasons someone might want to accept unassigned code points. An obvious one is if their application is on a newer version of unicode where the codepoint *is* assigned. > > The fact that there are multiple types of normalization and multiple > notions of equality doesn't make this easier. NFC is really the only one that makes sense. NFD is semantically the same as NFC, but expanded into a larger representation. NFKC/NFKD are based on a more relaxed notion of equality -- kind of like non-deterministic collations. These other forms might make sense in certain cases, but not general use. I believe that having a kind of text data type where it's stored in NFC and compared with memcmp() would be a good place for many users to be - - probably most users. It's got all the performance and stability benefits of memcmp(), with slightly richer semantics. It's less likely that someone malicious can confuse the database by using different representations of the same character. The problem is that it's not universally better for everyone: there are certainly users who would prefer that the codepoints they send to the database are preserved exactly, and also users who would like to be able to use unassigned code points. Regards, Jeff Davis
On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: > The SQL standard would have me able to: > > CREATE TABLE foo ( > a CHARACTER VARYING CHARACTER SET UTF8, > b CHARACTER VARYING CHARACTER SET LATIN1 > ) > > and so on, and write character literals like > > _UTF8'Hello, world!' and _LATIN1'Hello, world!' Is there a use case for that? UTF-8 is able to encode any unicode code point, it's relatively compact, and it's backwards-compatible with 7- bit ASCII. If you have a variety of text data in your system (and in many cases even if not), then UTF-8 seems like the right solution. Text data encoded 17 different ways requires a lot of bookkeeping in the type system, and it also requires injecting a bunch of fallible transcoding operators around just to compare strings. Regards, Jeff Davis
On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote: > On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: > > The SQL standard would have me able to: > > > > [...] > > _UTF8'Hello, world!' and _LATIN1'Hello, world!' > > Is there a use case for that? UTF-8 is able to encode any unicode code > point, it's relatively compact, and it's backwards-compatible with 7- > bit ASCII. If you have a variety of text data in your system (and in > many cases even if not), then UTF-8 seems like the right solution. > > Text data encoded 17 different ways requires a lot of bookkeeping in > the type system, and it also requires injecting a bunch of fallible > transcoding operators around just to compare strings. Better that than TEXT blobs w/ the encoding given by the `CREATE DATABASE` or `initdb` default! It'd be a lot _less_ fragile to have all text tagged with an encoding (indirectly, via its type which then denotes the encoding). That would be a lot of work, but starting with just a UTF-8 text type would be an improvement. Nico --
On 2023-10-04 16:38, Jeff Davis wrote: > On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote: >> The SQL standard would have me able to: >> >> CREATE TABLE foo ( >> a CHARACTER VARYING CHARACTER SET UTF8, >> b CHARACTER VARYING CHARACTER SET LATIN1 >> ) >> >> and so on > > Is there a use case for that? UTF-8 is able to encode any unicode code > point, it's relatively compact, and it's backwards-compatible with 7- > bit ASCII. If you have a variety of text data in your system (and in > many cases even if not), then UTF-8 seems like the right solution. Well, for what reason does anybody run PG now with the encoding set to anything besides UTF-8? I don't really have my finger on that pulse. Could it be that it bloats common strings in their local script, and with enough of those to store, it could matter to use the local encoding that stores them more economically? Also, while any Unicode transfer format can encode any Unicode code point, I'm unsure whether it's yet the case that {any Unicode code point} is a superset of every character repertoire associated with every non-Unicode encoding. The cheap glaring counterexample is SQL_ASCII. Half those code points are *nobody knows what Unicode character* (or even *whether*). I'm not insisting that's a good thing, but it is a thing. It might be a very tidy future to say all text is Unicode and all server encodings are UTF-8, but I'm not sure it wouldn't still be a good step on the way to be able to store some things in their own encodings. We have JSON and XML now, two data types that are *formally defined* to accept any Unicode content, and we hedge and mumble and say (well, as long as it goes in the server encoding) and that makes me sad. Things like that should be easy to handle even without declaring UTF-8 as a server-wide encoding ... they already are their own distinct data types, and could conceivably know their own encodings. But there again, it's possible that going with unconditional UTF-8 for JSON or XML documents could, in some regions, bloat them. Regards, -Chap
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote: > Always store only UTF-8 in the database What problem does that solve? I don't see our encoding support as a big source of problems, given that database-wide UTF-8 already works fine. In fact, some postgres features only work with UTF-8. I agree that we shouldn't add a bunch of bookkeeping and type system support for per-column encodings without a clear use case, because that would have a cost. But right now it's just a database-wide thing. I don't see encodings as a major area to solve problems or innovate. At the end of the day, encodings have little semantic significance, and therefore limited upside and limited downside. Collations and normalization get more interesting, but those are happening at a higher layer than the encoding. > What about characters not in UTF-8? Honestly I'm not clear on this topic. Are the "private use" areas in unicode enough to cover use cases for characters not recognized by unicode? Which encodings in postgres can represent characters that can't be automatically transcoded (without failure) to unicode? Obviously if we have some kind of unicode-based type, it would only work with encodings that are a subset of unicode. Regards, Jeff Davis
On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote: > Well, for what reason does anybody run PG now with the encoding set > to anything besides UTF-8? I don't really have my finger on that pulse. Because they still have databases that didn't use UTF-8 10 or 20 years ago that they haven't migrated to UTF-8? It's harder to think of why one might _want_ to store text in any encoding other than UTF-8 for _new_ databases. Though too there's no reason that it should be impossible other than lack of developer interest: as long as text is tagged with its encoding, it should be possible to store text in any number of encodings. > Could it be that it bloats common strings in their local script, and > with enough of those to store, it could matter to use the local > encoding that stores them more economically? UTF-8 bloat is not likely worth the trouble. UTF-8 is only clearly bloaty when compared to encodings with 1-byte code units, like ISO-8859-*. For CJK UTF-8 is not much more bloaty than native non-Unicode encodings like SHIFT_JIS. UTF-8 is not much bloatier than UTF-16 in general either. Bloat is not really a good reason to avoid Unicode or any specific TF. > Also, while any Unicode transfer format can encode any Unicode code > point, I'm unsure whether it's yet the case that {any Unicode code > point} is a superset of every character repertoire associated with > every non-Unicode encoding. It's not always been the case that Unicode is a strict superset of all currently-in-use human scripts. Making Unicode a strict superset of all currently-in-use human scripts seems to be the Unicode Consortium's aim. I think you're asking why not just use UTF-8 for everything, all the time. It's a fair question. I don't have a reason to answer in the negative (maybe someone else does). But that doesn't mean that one couldn't want to store text in many encodings (e.g., for historical reasons). Nico --
On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote: > Better that than TEXT blobs w/ the encoding given by the `CREATE > DATABASE` or `initdb` default! From an engineering perspective, yes, per-column encodings would be more flexible. But I still don't understand who exactly would use that, and why. It would take an awful lot of effort to implement and make the code more complex, so we'd really need to see some serious demand for that. Regards, Jeff Davis
On Wed, Oct 04, 2023 at 04:01:26PM -0700, Jeff Davis wrote: > On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote: > > Better that than TEXT blobs w/ the encoding given by the `CREATE > > DATABASE` or `initdb` default! > > From an engineering perspective, yes, per-column encodings would be > more flexible. But I still don't understand who exactly would use that, > and why. Say you have a bunch of text files in different encodings for reasons (historical). And now say you want to store them in a database so you can index them and search them. Sure, you could use a filesystem, but you want an RDBMS. Well, the answer to this is "convert all those files to UTF-8". > It would take an awful lot of effort to implement and make the code > more complex, so we'd really need to see some serious demand for that. Yes, it's better to just use UTF-8. The DB could implement conversions to/from other codesets and encodings for clients that insist on it. Why would clients insist anyways? Better to do the conversions at the clients. In the middle its best to just have Unicode, and specifically UTF-8, then push all conversions to the edges of the system. Nico --
On Wed, 4 Oct 2023 at 17:37, Jeff Davis <pgsql@j-davis.com> wrote:
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> Always store only UTF-8 in the database
What problem does that solve? I don't see our encoding support as a big
source of problems, given that database-wide UTF-8 already works fine.
In fact, some postgres features only work with UTF-8.
My idea is in the context of a suggestion that we support specifying the encoding per column. I don't mean to suggest eliminating the ability to set a server-wide encoding, although I doubt there is any use case for using anything other than UTF-8 except for an old database that hasn’t been converted yet.
I see no reason to write different strings using different encodings in the data files, depending on what column they belong to. The various text types are all abstract data types which store sequences of characters (not bytes); if one wants bytes, then one has to encode them. Of course, if one wants UTF-8 bytes, then the encoding is, under the covers, the identity function, but conceptually it is still taking the characters stored in the database and converting them to bytes according to a specific encoding.
By contrast, although I don’t see it as a top-priority use case, I can imagine somebody wanting to restrict the characters stored in a particular column to characters that can be encoded in a particular encoding. That is what "CHARACTER SET LATIN1" and so on should mean.
> What about characters not in UTF-8?
Honestly I'm not clear on this topic. Are the "private use" areas in
unicode enough to cover use cases for characters not recognized by
unicode? Which encodings in postgres can represent characters that
can't be automatically transcoded (without failure) to unicode?
Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to my suggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions in this paragraph are “yes”, and “none”.
On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac.morland@gmail.com> wrote: >> > What about characters not in UTF-8? >> >> Honestly I'm not clear on this topic. Are the "private use" areas in >> unicode enough to cover use cases for characters not recognized by >> unicode? Which encodings in postgres can represent characters that >> can't be automatically transcoded (without failure) to unicode? > > Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to mysuggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions inthis paragraph are “yes”, and “none”. Years ago, I remember SJIS being cited as an example of an encoding that had characters which weren't part of Unicode. I don't know whether this is still a live issue. But I do think that sometimes users are reluctant to perform encoding conversions on the data that they have. Sometimes they're not completely certain what encoding their data is in, and sometimes they're worried that the encoding conversion might fail or produce wrong answers. In theory, if your existing data is validly encoded and you know what encoding it's in and it's easily mapped onto UTF-8, there's no problem. You can just transcode it and be done. But a lot of times the reality is a lot messier than that. Which gives me some sympathy with the idea of wanting multiple character sets within a database. Such a feature exists in some other database systems and is, presumably, useful to some people. On the other hand, to do that in PostgreSQL, we'd need to propagate the character set/encoding information into all of the places that currently get the typmod and collation, and that is not a small number of places. It's a lot of infrastructure for the project to carry around for a feature that's probably only going to continue to become less relevant. I suppose you never know, though. Maybe the Unicode consortium will explode in a tornado of fiery rage and there will be dueling standards making war over the proper way of representing an emoji of a dog eating broccoli for decades to come. In that case, our hypothetical multi-character-set feature might seem prescient. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 5 Oct 2023 at 07:32, Robert Haas <robertmhaas@gmail.com> wrote:
But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.
In the case you describe, the users don’t have text at all; they have bytes, and a vague belief about what encoding the bytes might be in and therefore what characters they are intended to represent. The correct way to store that in the database is using bytea. Text types should be for when you know what characters you want to store. In this scenario, the implementation detail of what encoding the database uses internally to write the data on the disk doesn't matter, any more than it matters to a casual user how a table is stored on disk.
Similarly, I don't believe we have a "YMD" data type which stores year, month, and day, without being specific as to whether it's Gregorian or Julian; if you have that situation, make a 3-tuple type or do something else. "Date" is for when you actually know what day you want to record.
On Thu, 2023-10-05 at 07:31 -0400, Robert Haas wrote: > It's a lot of infrastructure for the project to carry > around for a feature that's probably only going to continue to become > less relevant. Agreed, at least until we understand the set of users per-column encoding is important to. I acknowledge that the presence of per-column encoding in the standard is some kind of signal there, but not enough by itself to justify something so invasive. > I suppose you never know, though. On balance I think it's better to keep the code clean enough that we can adapt to whatever unanticipated things happen in the future; rather than to make the code very complicated trying to anticipate everything, and then being completely unable to adapt it when something unanticipated happens anyway. Regards, Jeff Davis
On Thu, Oct 05, 2023 at 07:31:54AM -0400, Robert Haas wrote: > [...] On the other hand, to do that in PostgreSQL, we'd need to > propagate the character set/encoding information into all of the > places that currently get the typmod and collation, and that is not a > small number of places. It's a lot of infrastructure for the project > to carry around for a feature that's probably only going to continue > to become less relevant. Text+encoding can be just like bytea with a one- or two-byte prefix indicating what codeset+encoding it's in. That'd be how to encode such text values on the wire, though on disk the column's type should indicate the codeset+encoding, so no need to add a prefix to the value. Complexity would creep in around when and whether to perform automatic conversions. The easy answer would be "never, on the server side", but on the client side it might be useful to convert to/from the locale's codeset+encoding when displaying to the user or accepting user input. If there's no automatic server-side codeset/encoding conversions then the server-side cost of supporting non-UTF-8 text should not be too high dev-wise -- it's just (famous last words) a generic text type parameterized by codeset+ encoding type. There would not even be a hard need for functions for conversions, though there would be demand for them. But I agree that if there's no need, there's no need. UTF-8 is great, and if only all PG users would just switch then there's not much more to do. Nico --
On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote: > In the case you describe, the users don’t have text at all; they have > bytes, and a vague belief about what encoding the bytes might be in > and therefore what characters they are intended to represent. The > correct way to store that in the database is using bytea. I wouldn't be so absolute. It's text data to the user, and is presumably working fine for them now, and if they switched to bytea today then 'foo' would show up as '\x666f6f' in psql. The point is that this is a somewhat messy problem because there's so much software out there that treats byte strings and textual data interchangably. Rust goes the extra mile to organize all of this, and it ends up with: * String -- always UTF-8, never NUL-terminated * CString -- NUL-terminated byte sequence with no internal NULs * OsString[3] -- needed to make a Path[4], which is needed to open a file[5] * Vec<u8> -- any byte sequence and I suppose we could work towards offering better support for these different types, the casts between them, and delivering them in a form the client can understand. But I wouldn't describe it as a solved problem with one "correct" solution. One takeaway from this discussion is that it would be useful to provide more flexibility in how values are represented to the client in a more general way. In addition to encoding, representational issues have come up with binary formats, bytea, extra_float_digits, etc. The collection of books by CJ Date & Hugh Darwen, et al. (sorry I don't remember exactly which books), made the theoretical case for explicitly distinguishing values from representations at the lanugage level. We're starting to see that representational issues can't be satisfied with a few special cases and hacks -- it's worth thinking about a general solution to that problem. There was also a lot of relevant discussion about how to think about overlapping domains (e.g. ASCII is valid in any of these text domains). > Text types should be for when you know what characters you want to > store. In this scenario, the implementation detail of what encoding > the database uses internally to write the data on the disk doesn't > matter, any more than it matters to a casual user how a table is > stored on disk. Perhaps the user and application do know, and there's some kind of subtlety that we're missing, or some historical artefact that we're not accounting for, and that somehow makes UTF-8 unsuitable. Surely there are applications that treat certain byte sequences in non-standard ways, and perhaps not all of those byte sequences can be reproduced by transcoding from UTF-8 to the client_encoding. In any case, I would want to understand in detail why a user thinks UTF8 is not good enough before I make too strong of a statement here. Even the terminal font that I use renders some "identical" unicode characters slightly differently depending on the code points from which they are composed. I believe that's an intentional convenience to make it more apparent why the "diff" command (or other byte-based tool) is showing a difference between two textually identical strings, but it's also a violation of unicode. (This is another reason why normalization might not be for everyone, but I believe it's still good in typical cases.) Regards, Jeff Davis
Nico Williams <nico@cryptonector.com> writes: > Text+encoding can be just like bytea with a one- or two-byte prefix > indicating what codeset+encoding it's in. That'd be how to encode > such text values on the wire, though on disk the column's type should > indicate the codeset+encoding, so no need to add a prefix to the value. The precedent of BOMs (byte order marks) suggests strongly that such a solution would be horrible to use. regards, tom lane
On Thu, Oct 05, 2023 at 03:49:37PM -0400, Tom Lane wrote: > Nico Williams <nico@cryptonector.com> writes: > > Text+encoding can be just like bytea with a one- or two-byte prefix > > indicating what codeset+encoding it's in. That'd be how to encode > > such text values on the wire, though on disk the column's type should > > indicate the codeset+encoding, so no need to add a prefix to the value. > > The precedent of BOMs (byte order marks) suggests strongly that > such a solution would be horrible to use. This is just how you encode the type of the string. You have any number of options. The point is that already PG can encode binary data, so if how to encode text of disparate encodings on the wire, building on top of the encoding of bytea is an option.
On 03.10.23 21:54, Jeff Davis wrote: >> Here, Jeff mentions normalization, but I think it's a major issue >> with >> collation support. If new code points are added, users can put them >> into the database before they are known to the collation library, and >> then when they become known to the collation library the sort order >> changes and indexes break. > > The collation version number may reflect the change in understanding > about assigned code points that may affect collation -- though I'd like > to understand whether this is guaranteed or not. This is correct. The collation version number produced by ICU contains the UCA version, which is effectively the Unicode version (14.0, 15.0, etc.). Since new code point assignments can only come from new Unicode versions, a new assigned code point will always result in a different collation version. For example, with ICU 70 / CLDR 40 / Unicode 14: select collversion from pg_collation where collname = 'unicode'; = 153.112 With ICU 72 / CLDR 42 / Unicode 15: = 153.120 > At minimum I think we need to have some internal functions to check for > unassigned code points. That belongs in core, because we generate the > unicode tables from a specific version. If you want to be rigid about it, you also need to consider whether the Unicode version used by the ICU library in use matches the one used by the in-core tables.
On 05.10.23 19:30, Jeff Davis wrote: > Agreed, at least until we understand the set of users per-column > encoding is important to. I acknowledge that the presence of per-column > encoding in the standard is some kind of signal there, but not enough > by itself to justify something so invasive. The per-column encoding support in SQL is clearly a legacy feature from before Unicode. If one were to write something like SQL today, one would most likely just specify, "everything is Unicode".
On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote: > If you want to be rigid about it, you also need to consider whether > the > Unicode version used by the ICU library in use matches the one used > by > the in-core tables. What problem are you concerned about here? I thought about it and I didn't see an obvious issue. If the ICU unicode version is ahead of the Postgres unicode version, and no unassigned code points are used according to the Postgres version, then there's no problem. And in the other direction, there might be some code points that are assigned according to the postgres unicode version but unassigned according to the ICU version. But that would be tracked by the collation version as you pointed out earlier, so upgrading ICU would be like any other ICU upgrade (with the same risks). Right? Regards, Jeff Davis
On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote: > Text+encoding can be just like bytea with a one- or two-byte prefix > indicating what codeset+encoding it's in. That'd be how to encode > such text values on the wire, though on disk the column's type should > indicate the codeset+encoding, so no need to add a prefix to the value. Well, that would be making the encoding a per-value property, rather than a per-column property like collation as I proposed. I can't see that working out very nicely, because encodings are collation-specific. It wouldn't make any sense if the column collation were en_US.UTF8 or ko_KR.eucKR or en_CA.ISO8859-1 (just to pick a few values that are legal on my machine) while data stored in the column was from a whole bunch of different encodings, at most one of which could be the one to which the column's collation applied. That would end up meaning, for example, that such a column was very hard to sort. For that and other reasons, I suspect that the utility of storing data from a variety of different encodings in the same database column is quite limited. What I think people really want is a whole column in some encoding that isn't the normal one for that database. That's not to say we should add such a feature, but if we do, I think it should be that, not a different encoding for every individual value. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote: > > Text+encoding can be just like bytea with a one- or two-byte prefix > > indicating what codeset+encoding it's in. That'd be how to encode > > such text values on the wire, though on disk the column's type should > > indicate the codeset+encoding, so no need to add a prefix to the value. > > Well, that would be making the encoding a per-value property, rather > than a per-column property like collation as I proposed. I can't see On-disk it would be just a property of the type, not part of the value. Nico --
On Thu, 2023-10-05 at 14:52 -0500, Nico Williams wrote: > This is just how you encode the type of the string. You have any > number > of options. The point is that already PG can encode binary data, so > if > how to encode text of disparate encodings on the wire, building on > top > of the encoding of bytea is an option. There's another significant discussion going on here: https://www.postgresql.org/message-id/CA+TgmoZ8r8xb_73WzKHGb00cV3tpHV_U0RHuzzMFKvLepdu2Jw@mail.gmail.com about how to handle binary formats better, so it's not clear to me that it's a great precedent to expand upon. At least not yet. I think it would be interesting to think more generally about these representational issues in a way that accounds for binary formats, extra_float_digits, client_encoding, etc. But I see that as more of an issue with how the client expects to receive the data -- nobody has a presented a reason in this thread that we need per-column encodings on the server. Regards, Jeff Davis
On Fri, Oct 6, 2023 at 1:38 PM Nico Williams <nico@cryptonector.com> wrote: > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote: > > > Text+encoding can be just like bytea with a one- or two-byte prefix > > > indicating what codeset+encoding it's in. That'd be how to encode > > > such text values on the wire, though on disk the column's type should > > > indicate the codeset+encoding, so no need to add a prefix to the value. > > > > Well, that would be making the encoding a per-value property, rather > > than a per-column property like collation as I proposed. I can't see > > On-disk it would be just a property of the type, not part of the value. I mean, that's not how it works. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Oct 06, 2023 at 02:17:32PM -0400, Robert Haas wrote: > On Fri, Oct 6, 2023 at 1:38 PM Nico Williams <nico@cryptonector.com> wrote: > > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote: > > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams <nico@cryptonector.com> wrote: > > > > Text+encoding can be just like bytea with a one- or two-byte prefix > > > > indicating what codeset+encoding it's in. That'd be how to encode > > > > such text values on the wire, though on disk the column's type should > > > > indicate the codeset+encoding, so no need to add a prefix to the value. > > > > > > Well, that would be making the encoding a per-value property, rather > > > than a per-column property like collation as I proposed. I can't see > > > > On-disk it would be just a property of the type, not part of the value. > > I mean, that's not how it works. Sure, because TEXT in PG doesn't have codeset+encoding as part of it -- it's whatever the database's encoding is. Collation can and should be a porperty of a column, since for Unicode it wouldn't be reasonable to make that part of the type. But codeset+encoding should really be a property of the type if PG were to support more than one. IMO.
On Fri, Oct 6, 2023 at 2:25 PM Nico Williams <nico@cryptonector.com> wrote: > > > > Well, that would be making the encoding a per-value property, rather > > > > than a per-column property like collation as I proposed. I can't see > > > > > > On-disk it would be just a property of the type, not part of the value. > > > > I mean, that's not how it works. > > Sure, because TEXT in PG doesn't have codeset+encoding as part of it -- > it's whatever the database's encoding is. Collation can and should be a > porperty of a column, since for Unicode it wouldn't be reasonable to > make that part of the type. But codeset+encoding should really be a > property of the type if PG were to support more than one. IMO. No, what I mean is, you can't just be like "oh, the varlena will be different in memory than on disk" as if that were no big deal. I agree that, as an alternative to encoding being a column property, it could instead be completely a type property, meaning that if you want to store, say, LATIN1 text in your UTF-8 database, you first create a latint1text data type and then use it, rather than, as in the model I proposed, creating a text column and then applying a setting like ENCODING latin1 to it. I think that there might be some problems with that model, but it could also have some benefits. If someone were going to make a run at implementing this, they might want to consider both designs and evaluate the tradeoffs. But, even if we were all convinced that this kind of feature was good to add, I think it would almost certainly be wrong to invent new varlena features along the way. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > What I think people really want is a whole column in > some encoding that isn't the normal one for that database. Do people really want that? I'd be curious to know why. A lot of modern projects are simply declaring UTF-8 to be the "one true way". I am not suggesting that we do that, but it seems odd to go in the opposite direction and have greater flexibility for many encodings. Regards, Jeff Davis
On Fri, 6 Oct 2023 at 15:07, Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.
Do people really want that? I'd be curious to know why.
A lot of modern projects are simply declaring UTF-8 to be the "one true
way". I am not suggesting that we do that, but it seems odd to go in
the opposite direction and have greater flexibility for many encodings.
And even if they want it, we can give it to them when we send/accept the data from the client; just because they want to store ISO-8859-1 doesn't mean the actual bytes on the disk need to be that. And by "client" maybe I mean the client end of the network connection, and maybe I mean the program that is calling in to libpq.
If they try to submit data that cannot possibly be encoded in the stated encoding because the bytes they submit don't correspond to any string in that encoding, then that is unambiguously an error, just as trying to put February 30 in a date column is an error.
Is there a single other data type where anybody is even discussing letting the client tell us how to write the data on disk?
On Fri, 6 Oct 2023, 21:08 Jeff Davis, <pgsql@j-davis.com> wrote:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.
Do people really want that? I'd be curious to know why.
One reason someone would like this is because a database cluster may have been initialized with something like --no-locale (thus getting defaulted to LC_COLLATE=C, which is desired behaviour and gets fast strcmp operations for indexing, and LC_CTYPE=SQL_ASCII, which is not exactly expected but can be sufficient for some workloads), but now that the data has grown they want to use utf8.EN_US collations in some of their new and modern table's fields?
Or, a user wants to maintain literal translation tables, where different encodings would need to be used for different languages to cover the full script when Unicode might not cover the full character set yet.
Additionally, I'd imagine specialized encodings like Shift_JIS could be more space efficient than UTF-8 for e.g. japanese text, which might be useful for someone who wants to be a bit more frugal with storage when they know text is guaranteed to be in some encoding's native language: compression can do the same work, but also adds significant overhead.
I've certainly experienced situations where I forgot to explicitly include the encoding in initdb --no-locale and then only much later noticed that my big data load is useless due to an inability to create UTF-8 collated indexes.
I often use --no-locale to make string indexing fast (locales/collation are not often important to my workload) and to block any environment variables from being carried over into the installation. An ability to set or update the encoding of columns would help reduce the pain: I would no longer have to re-initialize the database or cluster from 0.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: > > At minimum I think we need to have some internal functions to check > > for > > unassigned code points. That belongs in core, because we generate > > the > > unicode tables from a specific version. > > That's a good idea. Patch attached. I added a new perl script to parse UnicodeData.txt and generate a lookup table (of ranges, which can be binary-searched). The C entry point does the same thing as u_charType(), and I also matched the enum numeric values for convenience. I didn't use u_charType() because I don't think this kind of unicode functionality should depend on ICU, and I think it should match other Postgres Unicode functionality. Strictly speaking, I only needed to know whether it's unassigned or not, not the general category. But it seemed easy enough to return the general category, and it will be easier to create other potentially- useful functions on top of this. The tests do require ICU though, because I compare with the results of u_charType(). Regards, Jeff Davis
Attachment
On Fri, Oct 6, 2023 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote: > > What I think people really want is a whole column in > > some encoding that isn't the normal one for that database. > > Do people really want that? I'd be curious to know why. Because it's a feature that exists in other products and so having it eases migrations and/or replication of data between systems. I'm not saying that there are a lot of people who want this, any more. I think there used to be more interest in it. But the point of the comment was that people who want multiple character set support want it as a per-column property, not a per-value property. I've never heard of anyone wanting to store text blobs in multiple distinct character sets in the same column. But I have heard of people wanting text blobs in multiple distinct character sets in the same database, each one in its own column. -- Robert Haas EDB: http://www.enterprisedb.com
On 07.10.23 03:18, Jeff Davis wrote: > On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote: >>> At minimum I think we need to have some internal functions to check >>> for >>> unassigned code points. That belongs in core, because we generate >>> the >>> unicode tables from a specific version. >> That's a good idea. > Patch attached. Can you restate what this is supposed to be for? This thread appears to have morphed from "let's normalize everything" to "let's check for unassigned code points", but I'm not sure what we are aiming for now.
On 06.10.23 19:22, Jeff Davis wrote: > On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote: >> If you want to be rigid about it, you also need to consider whether >> the >> Unicode version used by the ICU library in use matches the one used >> by >> the in-core tables. > What problem are you concerned about here? I thought about it and I > didn't see an obvious issue. > > If the ICU unicode version is ahead of the Postgres unicode version, > and no unassigned code points are used according to the Postgres > version, then there's no problem. > > And in the other direction, there might be some code points that are > assigned according to the postgres unicode version but unassigned > according to the ICU version. But that would be tracked by the > collation version as you pointed out earlier, so upgrading ICU would be > like any other ICU upgrade (with the same risks). Right? It might be alright in this particular combination of circumstances. But in general if we rely on these tables for correctness (e.g., check that a string is normalized before passing it to a function that requires it to be normalized), we would need to consider this. The correct fix would then probably be to not use our own tables but use some ICU function to achieve the desired task.
On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut <peter@eisentraut.org> wrote: > Can you restate what this is supposed to be for? This thread appears to > have morphed from "let's normalize everything" to "let's check for > unassigned code points", but I'm not sure what we are aiming for now. Jeff can say what he wants it for, but one obvious application would be to have the ability to add a CHECK constraint that forbids inserting unassigned code points into your database, which would be useful if you're worried about forward-compatibility with collation definitions that might be extended to cover those code points in the future. Another application would be to find data already in your database that has this potential problem. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2023-10-10 at 10:02 -0400, Robert Haas wrote: > On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut > <peter@eisentraut.org> wrote: > > Can you restate what this is supposed to be for? This thread > > appears to > > have morphed from "let's normalize everything" to "let's check for > > unassigned code points", but I'm not sure what we are aiming for > > now. It was a "pre-proposal", so yes, the goalposts have moved a bit. Right now I'm aiming to get some primitives in place that will be useful by themselves, but also that we can potentially build on. Attached is a new version of the patch which introduces some SQL functions as well: * unicode_is_valid(text): returns true if all codepoints are assigned, false otherwise * unicode_version(): version of unicode Postgres is built with * icu_unicode_version(): version of Unicode ICU is built with I'm not 100% clear on the consequences of differences between the PG unicode version and the ICU unicode version, but because normalization uses the Postgres version of Unicode, I believe the Postgres version of Unicode should also be available to determine whether a code point is assigned or not. We may also find it interesting to use the PG Unicode tables for regex character classification. This is just an idea and we can discuss whether that makes sense or not, but having the primitives in place seems like a good idea regardless. > Jeff can say what he wants it for, but one obvious application would > be to have the ability to add a CHECK constraint that forbids > inserting unassigned code points into your database, which would be > useful if you're worried about forward-compatibility with collation > definitions that might be extended to cover those code points in the > future. Another application would be to find data already in your > database that has this potential problem. Exactly. Avoiding unassigned code points also allows you to be forward- compatible with normalization. Regards, Jeff Davis
Attachment
On 10.10.23 16:02, Robert Haas wrote: > On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> Can you restate what this is supposed to be for? This thread appears to >> have morphed from "let's normalize everything" to "let's check for >> unassigned code points", but I'm not sure what we are aiming for now. > > Jeff can say what he wants it for, but one obvious application would > be to have the ability to add a CHECK constraint that forbids > inserting unassigned code points into your database, which would be > useful if you're worried about forward-compatibility with collation > definitions that might be extended to cover those code points in the > future. I don't see how this would really work in practice. Whether your data has unassigned code points or not, when the collations are updated to the next Unicode version, the collations will have a new version number, and so you need to run the refresh procedure in any case.
On 11.10.23 03:08, Jeff Davis wrote: > * unicode_is_valid(text): returns true if all codepoints are > assigned, false otherwise We need to be careful about precise terminology. "Valid" has a defined meaning for Unicode. A byte sequence can be valid or not as UTF-8. But a string containing unassigned code points is not not-"valid" as Unicode. > * unicode_version(): version of unicode Postgres is built with > * icu_unicode_version(): version of Unicode ICU is built with This seems easy enough, but it's not clear what users would actually do with that.
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > On 11.10.23 03:08, Jeff Davis wrote: > > * unicode_is_valid(text): returns true if all codepoints are > > assigned, false otherwise > > We need to be careful about precise terminology. "Valid" has a > defined > meaning for Unicode. A byte sequence can be valid or not as UTF-8. > But > a string containing unassigned code points is not not-"valid" as > Unicode. Agreed. Perhaps "unicode_assigned()" is better? > > * unicode_version(): version of unicode Postgres is built with > > * icu_unicode_version(): version of Unicode ICU is built with > > This seems easy enough, but it's not clear what users would actually > do > with that. Just there to make it visible. If it affects the semantics (which it does currently for normalization) it seems wise to have some way to access the version. Regards, Jeff Davis
On Wed, 2023-10-11 at 08:51 +0200, Peter Eisentraut wrote: > I don't see how this would really work in practice. Whether your > data > has unassigned code points or not, when the collations are updated to > the next Unicode version, the collations will have a new version > number, > and so you need to run the refresh procedure in any case. Even with a version number, we don't provide a great reresh procedure or document how it should be done. In practice, avoiding unassigned code points might mitigate some kinds of problems, especially for glibc which has a very coarse version number. In any case, a CHECK constraint to avoid unassigned code points has utility to be forward-compatible with normalization, and also might just be a good sanity check. Regards, Jeff Davis
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > We need to be careful about precise terminology. "Valid" has a > defined > meaning for Unicode. A byte sequence can be valid or not as UTF-8. > But > a string containing unassigned code points is not not-"valid" as > Unicode. New patch attached, function name is "unicode_assigned". I believe the patch has utility as-is, but I've been brainstorming a few more ideas that could build on it: * Add a per-database option to enforce only storing assigned unicode code points. * (More radical) Add a per-database option to normalize all text in NFC. * Do character classification in Unicode rather than relying on glibc/ICU. This would affect regex character classes, etc., but not affect upper/lower/initcap nor collation. I did some experiments and the General Category doesn't change a lot: a total of 197 characters changed their General Category since Unicode 6.0.0, and only 5 since ICU 11.0.0. I'm not quite sure how to expose this, but it seems like a nicer way to handle it than tying it into the collation provider. Regards, Jeff Davis
Attachment
Jeff Davis wrote: > I believe the patch has utility as-is, but I've been brainstorming a > few more ideas that could build on it: > > * Add a per-database option to enforce only storing assigned unicode > code points. There's a problem in the fact that the set of assigned code points is expanding with every Unicode release, which happens about every year. If we had this option in Postgres 11 released in 2018 it would use Unicode 11, and in 2023 this feature would reject thousands of code points that have been assigned since then. Aside from that, aborting a transaction because there's an unassigned code point in a string feels like doing too much, too late. The programs that want to filter out unwanted code points do it before they hit the database, client-side. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On Tue, Oct 17, 2023 at 11:07 AM Daniel Verite <daniel@manitou-mail.org> wrote: > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every year. > > If we had this option in Postgres 11 released in 2018 it would use > Unicode 11, and in 2023 this feature would reject thousands of code > points that have been assigned since then. Are code points assigned from a gapless sequence? That is, is the implementation of codepoint_is_assigned(char) just 'codepoint < SOME_VALUE' and SOME_VALUE increases over time? If so, we could consider having a function that lets you specify the bound as an input parameter. But whether anyone would use it, or know how to set that input parameter, is questionable. The real issue here is whether you can figure out which of the code points that you could put into the database already have collation definitions. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 17 Oct 2023 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote:
Are code points assigned from a gapless sequence? That is, is the
implementation of codepoint_is_assigned(char) just 'codepoint <
SOME_VALUE' and SOME_VALUE increases over time?
Not even close. Code points are organized in blocks, e.g. for mathematical symbols or Ethiopic script. Sometimes new blocks are added, sometimes new characters are added to existing blocks. Where they go is a combination of convenience, history, and planning.
On Tue, Oct 17, 2023 at 11:38 AM Isaac Morland <isaac.morland@gmail.com> wrote: > On Tue, 17 Oct 2023 at 11:15, Robert Haas <robertmhaas@gmail.com> wrote: >> Are code points assigned from a gapless sequence? That is, is the >> implementation of codepoint_is_assigned(char) just 'codepoint < >> SOME_VALUE' and SOME_VALUE increases over time? > > Not even close. Code points are organized in blocks, e.g. for mathematical symbols or Ethiopic script. Sometimes new blocksare added, sometimes new characters are added to existing blocks. Where they go is a combination of convenience, history,and planning. Ah. Good to know. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2023-10-17 at 17:07 +0200, Daniel Verite wrote: > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every year. > > If we had this option in Postgres 11 released in 2018 it would use > Unicode 11, and in 2023 this feature would reject thousands of code > points that have been assigned since then. That wouldn't be good for everyone, but might it be good for some users? We already expose normalization functions. If users are depending on normalization, and they have unassigned code points in their system, that will break when we update Unicode. By restricting themselves to assigned code points, normalization is guaranteed to be forward- compatible. Regards, Jeff Davis
On Mon, 2023-10-16 at 20:32 -0700, Jeff Davis wrote: > On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote: > > We need to be careful about precise terminology. "Valid" has a > > defined > > meaning for Unicode. A byte sequence can be valid or not as UTF- > > 8. > > But > > a string containing unassigned code points is not not-"valid" as > > Unicode. > > New patch attached, function name is "unicode_assigned". I plan to commit something like v3 early next week unless someone else has additional comments or I missed a concern. Regards, Jeff Davis
bowerbird and hammerkop didn't like commit a02b37fc. They're still using the old 3rd build system that is not tested by CI. It's due for removal in the 17 cycle IIUC but in the meantime I guess the new codegen script needs to be invoked by something under src/tools/msvc? varlena.obj : error LNK2019: unresolved external symbol unicode_category referenced in function unicode_assigned [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj]
On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote: > > Sure, because TEXT in PG doesn't have codeset+encoding as part of it -- > > it's whatever the database's encoding is. Collation can and should be a > > porperty of a column, since for Unicode it wouldn't be reasonable to > > make that part of the type. But codeset+encoding should really be a > > property of the type if PG were to support more than one. IMO. > > No, what I mean is, you can't just be like "oh, the varlena will be > different in memory than on disk" as if that were no big deal. It would have to be the same in memory as on disk, indeed, but you might need new types in C as well for that. > I agree that, as an alternative to encoding being a column property, > it could instead be completely a type property, meaning that if you > want to store, say, LATIN1 text in your UTF-8 database, you first > create a latint1text data type and then use it, rather than, as in the > model I proposed, creating a text column and then applying a setting > like ENCODING latin1 to it. I think that there might be some problems Yes, that was the idea. > with that model, but it could also have some benefits. [...] Mainly, I think, whether you want PG to do automatic codeset conversions (ugly and problematic) or not, like for when using text functions. Automatic codeset conversions are problematic because a) it can be lossy (so what to do when it is?) and b) automatic type conversions can be surprising. Ultimately the client would have to do its own codeset conversions, if it wants them, or treat text in codesets other than its local one as blobs and leave it for a higher app layer to deal with. I wouldn't want to propose automatic codeset conversions. If you'd want that then you might as well declare it has to all be UTF-8 and say no to any other codesets. > But, even if we were all convinced that this kind of feature was good > to add, I think it would almost certainly be wrong to invent new > varlena features along the way. Yes. Nico --
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote: > There's a very popular commercial database where, or so I have been > led to believe, any byte sequence at all is accepted when you try to > put values into the database. [...] In other circles we call this "just-use-8". ZFS, for example, has an option to require that filenames be valid UTF-8 or not, and if not it will accept any garbage (other than ASCII NUL and /, for obvious reasons). For filesystems the situation is a bit dire because: - strings at the system call boundary have never been tagged with a codeset (in the beginning there was only ASCII) - there has never been a standard codeset to use at the system call boundary, - there have been multiple codesets in use for decades so filesystems have to be prepared to be tolerant of garbage, at least until only Unicode is left (UTF-16 on Windows filesystems, UTF-8 for most others). This is another reason that ZFS has form-insensitive/form-preserving behavior: if you want to use non-UTF-8 filenames then names or substrings thereof that look like valid UTF-8 won't accidentally be broken by normalization. If PG never tagged strings with codesets on the wire then PG has the same problem, especially since there's multiple implementations of the PG wire protocol. So I can see why a "popular database" might want to take this approach. For the longer run though, either move to supporting only UTF-8, or allow multiple text types each with a codeset specified in its type. > At any rate, if we were to go in the direction of rejecting code > points that aren't yet assigned, or aren't yet known to the collation > library, that's another way for data loading to fail. Which feels like > very defensible behavior, but not what everyone wants, or is used to. Yes. See points about ZFS. I do think ZFS struck a good balance. PG could take the ZFS approach and add functions for use in CHECK constraints that enforce valid UTF-8, valid Unicode (no use of unassigned codepoints, no use of private use codepoints not configured into the database), etc. Coming back to the "just-use-8" thing, a database could have a text type where the codeset is not specified, one or more text types where the codeset is specified, manual or automatic codeset conversions, and whatever enforcement functions make sense. Provided that the type information is not lost at the edges. > > Whether we ever get to a core data type -- and more importantly, > > whether anyone uses it -- I'm not sure. > > Same here. A TEXTutf8 type (whatever name you want to give it) could be useful as a way to a) opt into heavier enforcement w/o having to write CHECK constraints, b) documentation of intent, all provided that the type is not lost on the wire nor in memory. Support for other codesets is less important. Nico --
On Tue, Oct 17, 2023 at 05:07:40PM +0200, Daniel Verite wrote: > > * Add a per-database option to enforce only storing assigned unicode > > code points. > > There's a problem in the fact that the set of assigned code points is > expanding with every Unicode release, which happens about every year. > > If we had this option in Postgres 11 released in 2018 it would use > Unicode 11, and in 2023 this feature would reject thousands of code > points that have been assigned since then. Yes, and that's desirable if PG were to normalize text as Jeff proposes, since then PG wouldn't know how to normalize text containing codepoints assigned after that. At that point to use those codepoints you'd have to upgrade PG -- not too unreasonable. Nico --
On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote: > > The fact that there are multiple types of normalization and multiple > > notions of equality doesn't make this easier. And then there's text that isn't normalized to any of them. > NFC is really the only one that makes sense. Yes. Most input modes produce NFC, though there may be scripts (like Hangul) where input modes might produce NFD, so I wouldn't say NFC is universal. Unfortunately HFS+ uses NFD so NFD can leak into places naturally enough through OS X. > I believe that having a kind of text data type where it's stored in NFC > and compared with memcmp() would be a good place for many users to be - > - probably most users. It's got all the performance and stability > benefits of memcmp(), with slightly richer semantics. It's less likely > that someone malicious can confuse the database by using different > representations of the same character. > > The problem is that it's not universally better for everyone: there are > certainly users who would prefer that the codepoints they send to the > database are preserved exactly, and also users who would like to be > able to use unassigned code points. The alternative is forminsensitivity, where you compare strings as equal even if they aren't memcmp() eq as long as they are equal when normalized. This can be made fast, though not as fast as memcmp(). The problem with form insensitivity is that you might have to implement it in numerous places. In ZFS there's only a few, but in a database every index type, for example, will need to hook in form insensitivity. If so then that complexity would be a good argument to just normalize. Nico --
On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > bowerbird and hammerkop didn't like commit a02b37fc. They're still > using the old 3rd build system that is not tested by CI. It's due > for > removal in the 17 cycle IIUC but in the meantime I guess the new > codegen script needs to be invoked by something under src/tools/msvc? > > varlena.obj : error LNK2019: unresolved external symbol > unicode_category referenced in function unicode_assigned > [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj] I think I just need to add unicode_category.c to @pgcommonallfiles in Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it unless someone has a better suggestion. Regards, Jeff Davis
On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > > bowerbird and hammerkop didn't like commit a02b37fc. They're still > > using the old 3rd build system that is not tested by CI. It's due > > for > > removal in the 17 cycle IIUC but in the meantime I guess the new > > codegen script needs to be invoked by something under src/tools/msvc? > > > > varlena.obj : error LNK2019: unresolved external symbol > > unicode_category referenced in function unicode_assigned > > [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj] > > I think I just need to add unicode_category.c to @pgcommonallfiles in > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it > unless someone has a better suggestion. (I didn't realise this was being discussed.) Thomas mentioned this to me earlier today. After looking I also concluded that unicode_category.c needed to be added to @pgcommonallfiles. After looking at the time, I didn't expect you to be around so opted just to push that to fix the MSVC buildfarm members. Sorry for the duplicate effort and/or stepping on your toes. David
On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis <pgsql@j-davis.com> wrote: > > I plan to commit something like v3 early next week unless someone else > has additional comments or I missed a concern. Hi Jeff, is the CF entry titled "Unicode character general category functions" ready to be marked committed?
On Fri, 2023-11-03 at 21:01 +1300, David Rowley wrote: > Thomas mentioned this to me earlier today. After looking I also > concluded that unicode_category.c needed to be added to > @pgcommonallfiles. After looking at the time, I didn't expect you to > be around so opted just to push that to fix the MSVC buildfarm > members. > > Sorry for the duplicate effort and/or stepping on your toes. Thank you, no apology necessary. Regards, Jeff Davis
On Fri, 2023-11-03 at 17:11 +0700, John Naylor wrote: > On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis <pgsql@j-davis.com> wrote: > > > > I plan to commit something like v3 early next week unless someone > > else > > has additional comments or I missed a concern. > > Hi Jeff, is the CF entry titled "Unicode character general category > functions" ready to be marked committed? Done, thank you. Regards, Jeff Davis
On 2023-10-04 23:32, Chapman Flack wrote: > Well, for what reason does anybody run PG now with the encoding set > to anything besides UTF-8? I don't really have my finger on that pulse. > Could it be that it bloats common strings in their local script, and > with enough of those to store, it could matter to use the local > encoding that stores them more economically? I do use CP1251 for storing some data which is coming in as XMLs in CP1251, and thus definitely fits. In UTF-8, that data would take exactly 2x the size on disks (before compression, and pglz/lz4 won't help much with that). -- Ph.
On Fri, Nov 3, 2023 at 9:01 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote: > > > bowerbird and hammerkop didn't like commit a02b37fc. They're still > > > using the old 3rd build system that is not tested by CI. It's due > > > for > > > removal in the 17 cycle IIUC but in the meantime I guess the new > > > codegen script needs to be invoked by something under src/tools/msvc? > > > > > > varlena.obj : error LNK2019: unresolved external symbol > > > unicode_category referenced in function unicode_assigned > > > [H:\\prog\\bf\\root\\HEAD\\pgsql.build\\postgres.vcxproj] > > > > I think I just need to add unicode_category.c to @pgcommonallfiles in > > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it > > unless someone has a better suggestion. > > (I didn't realise this was being discussed.) > > Thomas mentioned this to me earlier today. After looking I also > concluded that unicode_category.c needed to be added to > @pgcommonallfiles. After looking at the time, I didn't expect you to > be around so opted just to push that to fix the MSVC buildfarm > members. Shouldn't it be added unconditionally near unicode_norm.c? It looks like it was accidentally made conditional on openssl, which might explain why it worked for David but not for bowerbird.
On Sat, 4 Nov 2023 at 10:57, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Nov 3, 2023 at 9:01 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 3 Nov 2023 at 20:49, Jeff Davis <pgsql@j-davis.com> wrote: > > > I think I just need to add unicode_category.c to @pgcommonallfiles in > > > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if that fixes it > > > unless someone has a better suggestion. > > > > Thomas mentioned this to me earlier today. After looking I also > > concluded that unicode_category.c needed to be added to > > @pgcommonallfiles. After looking at the time, I didn't expect you to > > be around so opted just to push that to fix the MSVC buildfarm > > members. > > Shouldn't it be added unconditionally near unicode_norm.c? It looks > like it was accidentally made conditional on openssl, which might > explain why it worked for David but not for bowerbird. Well, I did that one pretty poorly :-( I've just pushed a fix for that. Thanks. David
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote: > It seems to me that this overlooks one of the major points of Jeff's > proposal, which is that we don't reject text input that contains > unassigned code points. That decision turns out to be really painful. Attached is an implementation of a per-database option STRICT_UNICODE which enforces the use of assigned code points only. Not everyone would want to use it. There are lots of applications that accept free-form text, and that may include recently-assigned code points not yet recognized by Postgres. But it would offer protection/stability for some databases. It makes it possible to have a hard guarantee that Unicode normalization is stable[1]. And it may also mitigate the risk of collation changes -- using unassigned code points carries a high risk that the collation order changes as soon as the collation provider recognizes the assignment. (Though assigned code points can change, too, so limiting yourself to assigned code points is only a mitigation.) I worry slightly that users will think at first that they want only assigned code points, and then later figure out that the application has increased in scope and now takes all kinds of free-form text. In that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and follow up with some "CHECK (unicode_assigned(...))" constraints on the particular fields that they'd like to protect. There's some weirdness that the set of assigned code points as Postgres sees it may not match what a collation provider sees due to differing Unicode versions. That's not great -- perhaps we could check that code points are considered assigned by *both* Postgres and ICU. I don't know if there's a way to tell if libc considers a code point to be assigned. Regards, Jeff Davis [1] https://www.unicode.org/policies/stability_policy.html#Normalization
Attachment
On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote: > Attached is an implementation of a per-database option STRICT_UNICODE > which enforces the use of assigned code points only. The CF app doesn't seem to point at the latest patch: https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b31a840a915a9d349.camel@j-davis.com which is perhaps why nobody has looked at it yet. But in any case, I'm OK if this gets bumped to 18. I still think it's a good feature, but some of the value will come later in v18 anyway, when I plan to propose support for case folding. Case folding is a version of lowercasing with compatibility guarantees when you only use assigned code points. Regards, Jeff Davis