Re: Rework of collation code, extensibility - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Rework of collation code, extensibility |
Date | |
Msg-id | CAH2-WzmQ+qe_bemX=S6T_xvZ-cpfDOV06imo3kpmx4m7b6b7jg@mail.gmail.com Whole thread Raw |
In response to | Re: Rework of collation code, extensibility (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Rework of collation code, extensibility
|
List | pgsql-hackers |
On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote: > Attached trivial rebase as v6. Some review comments for this v6. Comments on 0001-*: * I think that 0002-* can be squashed with 0001-*, since there isn't any functional reason why you'd want to commit the strcoll() and strxfrm() changes separately. Sometimes it can be useful to break things up, despite the fact that it couldn't possibly make sense to commit just one of the resulting patches on its own. However, I don't think that that's appropriate here. There is no apparent conceptual boundary that you're highlighting by splitting things up like this. strxfrm() and strcoll() are defined in terms of each other -- they're siblings, joined at the hip -- so this seems kinda jarring. * Your commit message for 0001 (and other patches) don't set things up by describing what the point is, and what the work anticipates. I think that they should do that. You're adding a layer of indirection that's going to set things up for later patches that add a layer of indirection for version ICU libraries (and even libc itself), and some of the details only make sense in that context. This isn't just refactoring work that could just as easily have happened in some quite different context. * I'm not sure that pg_strcoll() should be breaking ties itself. We break ties using strcmp() for historical reasons, but must not do that for deterministic ICU collations, which may be obscured. That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't the same as the well known strcoll()/strxfrm() relationship. That kind of makes pg_strcoll() somewhat more than a strcoll() shim, which is inconsistent. Another concern is that the deterministic collation handling isn't handled in any one layer, which would have been nice. Do we need to do things this way? What's it adding? * varstrfastcmp_locale() is no longer capable of calling ucol_strcollUTF8() through the shim interface, meaning that it has to determine string length based on NUL-termination, when in principle it could just use the known length of the string. Presumably this might have performance implications. Have you thought about that? Some comments on 0002-*: * I don't see much point in this new varstr_abbrev_convert() variable: + const size_t max_prefix_bytes = sizeof(Datum); varstr_abbrev_convert() is concerned with packing abbreviated key bytes into Datums, so it's perfectly reasonable to deal with Datums/sizeof(Datum) directly. * Having a separate pg_strxfrm_prefix_libc() function just to throw an error doesn't really add much IMV. Comments on 0003-*: I suggest that some of the very trivial functions you have here (such as pg_locale_deterministic()) be made inline functions. Comments on 0006-*: * get_builtin_libc_library() could be indented in a way that would make it easier to understand. -- Peter Geoghegan
pgsql-hackers by date: