Thread: Rework of collation code, extensibility
Attached is a new patch series. I think there are enough changes that this has become more of a "rework" of the collation code rather than just a refactoring. This is a continuation of some prior work[1][2] in a new thread given its new scope. Benefits: 1. Clearer division of responsibilities. 2. More consistent between libc and ICU providers. 3. Hooks that allow extensions to replace collation provider libraries. 4. New tests for the collation provider library hooks. There are a lot of changes, and still some loose ends, but I believe a few of these patches are close to ready. This set of changes does not express an opinion on how we might want to support multiple provider libraries in core; but whatever we choose, it should be easier to accomplish. Right now, the hooks have limited information on which to make the choice for a specific version of a collation provider library, but that's because there's limited information in the catalog. If the discussion here[3] concludes in adding collation provider library or library version information to the catalog, we can add additional parameters to the hooks. [1] https://postgr.es/m/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com [2] https://postgr.es/m/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com [3] https://postgr.es/m/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
- v4-0001-Add-pg_strcoll-and-pg_strncoll.patch
- v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch
- v4-0003-Refactor-pg_locale_t-routines.patch
- v4-0004-Support-multiple-ICU-collation-provider-libraries.patch
- v4-0005-Support-multiple-libc-collation-provider-librarie.patch
- v4-0006-Add-tests-for-collation-provider-hooks.patch
On Sat, Dec 17, 2022 at 7:14 PM Jeff Davis <pgsql@j-davis.com> wrote:
Attached is a new patch series. I think there are enough changes that
this has become more of a "rework" of the collation code rather than
just a refactoring. This is a continuation of some prior work[1][2] in
a new thread given its new scope.
Benefits:
1. Clearer division of responsibilities.
2. More consistent between libc and ICU providers.
3. Hooks that allow extensions to replace collation provider libraries.
4. New tests for the collation provider library hooks.
There are a lot of changes, and still some loose ends, but I believe a
few of these patches are close to ready.
This set of changes does not express an opinion on how we might want to
support multiple provider libraries in core; but whatever we choose, it
should be easier to accomplish. Right now, the hooks have limited
information on which to make the choice for a specific version of a
collation provider library, but that's because there's limited
information in the catalog. If the discussion here[3] concludes in
adding collation provider library or library version information to the
catalog, we can add additional parameters to the hooks.
[1]
https://postgr.es/m/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com
[2]
https://postgr.es/m/c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com
[3]
https://postgr.es/m/CA+hUKGLEqMhnpZrgAcisoUeYFGz8W6EWdhtK2h-4QN0iOSFRqw@mail.gmail.com
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Hi,
For pg_strxfrm_libc in v4-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch:
+ if (locale)
+ return strxfrm_l(dest, src, destsize, locale->info.lt);
+ else
+#endif
+ return strxfrm(dest, src, destsize);
It seems the `else` is not needed (since when the if branch is taken, we return from the func).
+ /* nul-terminate arguments */
nul-terminate -> null-terminate
For pg_strnxfrm(), I think `result` can be removed - we directly return the result from pg_strnxfrm_libc or pg_strnxfrm_icu.
Cheers
On Sun, Dec 18, 2022 at 10:28 AM Ted Yu <yuzhihong@gmail.com> wrote:
> It seems the `else` is not needed (since when the if branch is taken, we return from the func).
By that same logic, this review comment is not needed, since compiler vendors don't charge license fees by the number of keywords. ;-)
Joking aside, we don't really have a project style preference for this case.
> nul-terminate -> null-terminate
NUL is a common abbreviation for the zero byte (but not for zero pointers). See the ascii manpage.
--
John Naylor
EDB: http://www.enterprisedb.com
NUL is a common abbreviation for the zero byte (but not for zero pointers). See the ascii manpage.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, Dec 17, 2022 at 8:54 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> nul-terminate -> null-terminate
NUL is a common abbreviation for the zero byte (but not for zero pointers). See the ascii manpage.
--
John Naylor
EDB: http://www.enterprisedb.com
Ah.
`nul-terminated` does appear in the codebase.
Should have checked earlier.
On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote: > Attached is a new patch series. I think there are enough changes that > this has become more of a "rework" of the collation code rather than > just a refactoring. This is a continuation of some prior work[1][2] > in > a new thread given its new scope. Here's version 5. There are a number of fixes, and better tests, and it's passing in CI. The libc hook support is still experimental, but what's working is passing in CI, even on windows. The challenges with libc hook support are: * It obviously doesn't replace all of libc, so the separation is not as clean and there are a number of callers throughout the code that don't necessarily care about specific collations. * libc relies on setlocale() / uselocale(), which is global state and not as easy to track. * More platform issues (obviously) and harder to test. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
- v5-0007-Add-test-module-for-libc-collation-provider-hook.patch
- v5-0006-Support-multiple-libc-collation-provider-librarie.patch
- v5-0005-Add-test-module-for-icu-collation-provider-hook.patch
- v5-0004-Support-multiple-ICU-collation-provider-libraries.patch
- v5-0003-Refactor-pg_locale_t-routines.patch
- v5-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch
- v5-0001-Add-pg_strcoll-and-pg_strncoll.patch
On 22.12.22 06:40, Jeff Davis wrote: > On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote: >> Attached is a new patch series. I think there are enough changes that >> this has become more of a "rework" of the collation code rather than >> just a refactoring. This is a continuation of some prior work[1][2] >> in >> a new thread given its new scope. > > Here's version 5. There are a number of fixes, and better tests, and > it's passing in CI. > > The libc hook support is still experimental, but what's working is > passing in CI, even on windows. The challenges with libc hook support > are: > > * It obviously doesn't replace all of libc, so the separation is not > as clean and there are a number of callers throughout the code that > don't necessarily care about specific collations. > > * libc relies on setlocale() / uselocale(), which is global state and > not as easy to track. > > * More platform issues (obviously) and harder to test. I'm confused by this patch set. It combines some refactoring that was previously posted with partial support for multiple ICU libraries with partial support for some new hooks. Shouldn't those be three separate threads? I think the multiple ICU libraries already does have a separate thread; how does this relate to that work? I don't know what the hooks are supposed to be for? What other locale libraries are you thinking about using this way? How can we asses whether these interfaces are sufficient for that? The refactoring patches don't look convincing just by looking at the numbers: 3 files changed, 406 insertions(+), 247 deletions(-) 6 files changed, 481 insertions(+), 150 deletions(-) 12 files changed, 400 insertions(+), 323 deletions(-) My sense is this is trying to do too many things at once, and those things are each not fully developed yet.
On Wed, 2023-01-04 at 22:46 +0100, Peter Eisentraut wrote: > It combines some refactoring that was previously posted with partial > support for multiple ICU libraries with partial support for some new > hooks. Shouldn't those be three separate threads? Originally they felt more separate to me, too; but as I worked on them it seemed better to consider them as a patch series. Whatever is easier for reviewers works for me, though. > I think the multiple > ICU libraries already does have a separate thread; how does this > relate > to that work? Multilib ICU support adds complexity, and my hope is that this patch set cleans up and organizes things to better prepare for that complexity. > I don't know what the hooks are supposed to be for? I found them very useful for testing during development. One of the patches adds a test module for the ICU hook, and I think that's a valuable place to test regardless of whether any other extension uses the hook. Also, if proper multilib support doesn't land in 16, then the hooks could be a way to build rudimentary multilib support (or at least some kind of ICU version lockdown) until it does land. When Thomas's work is in place, I expect the hooks to change slightly. The hooks are not meant to set any specific API in stone. > What > other locale libraries are you thinking about using this way? How > can > we asses whether these interfaces are sufficient for that? I'm not considering any other locale libraries, nor did I see much discussion of that. > The > refactoring patches don't look convincing just by looking at the > numbers: > > 3 files changed, 406 insertions(+), 247 deletions(-) > 6 files changed, 481 insertions(+), 150 deletions(-) > 12 files changed, 400 insertions(+), 323 deletions(-) The existing code is not great, in my opinion: it doesn't have clear API boundaries, the comments are insufficient, and lots of special cases need to be handled awkwardly by callers. That style is hard to beat when it comes to the raw line count; but it's quite difficult to understand and work on. I think my changes are an improvement, but obviously that depends on the opinion of others who are working in this part of the code. What do you think? -- Jeff Davis PostgreSQL Contributor Team - AWS
On 06.01.23 08:04, Jeff Davis wrote: > The existing code is not great, in my opinion: it doesn't have clear > API boundaries, the comments are insufficient, and lots of special > cases need to be handled awkwardly by callers. That style is hard to > beat when it comes to the raw line count; but it's quite difficult to > understand and work on. > > I think my changes are an improvement, but obviously that depends on > the opinion of others who are working in this part of the code. What do > you think? I think the refactoring that you proposed in the thread "Refactor to introduce pg_strcoll()." was on a sensible track. Maybe we should try to get that done. The multiple-ICU stuff is still experimental and has its own rather impressive thread, so I don't think it's sensible to try to sort that out here.
On Thu, 22 Dec 2022 at 11:11, Jeff Davis <pgsql@j-davis.com> wrote: > > On Sat, 2022-12-17 at 19:14 -0800, Jeff Davis wrote: > > Attached is a new patch series. I think there are enough changes that > > this has become more of a "rework" of the collation code rather than > > just a refactoring. This is a continuation of some prior work[1][2] > > in > > a new thread given its new scope. > > Here's version 5. There are a number of fixes, and better tests, and > it's passing in CI. > > The libc hook support is still experimental, but what's working is > passing in CI, even on windows. The challenges with libc hook support > are: > > * It obviously doesn't replace all of libc, so the separation is not > as clean and there are a number of callers throughout the code that > don't necessarily care about specific collations. > > * libc relies on setlocale() / uselocale(), which is global state and > not as easy to track. > > * More platform issues (obviously) and harder to test. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID c971a5b27ac946e7c94f7f655d321279512c7ee7 === === applying patch ./v5-0003-Refactor-pg_locale_t-routines.patch .... Hunk #1 FAILED at 88. ... 1 out of 9 hunks FAILED -- saving rejects to file src/backend/utils/adt/formatting.c.rej patching file src/backend/utils/adt/like.c Hunk #1 FAILED at 24. Hunk #2 succeeded at 97 (offset 1 line). 1 out of 2 hunks FAILED -- saving rejects to file src/backend/utils/adt/like.c.rej [1] - http://cfbot.cputube.org/patch_41_4058.log Regards, Vignesh
On Wed, 2022-12-21 at 21:40 -0800, Jeff Davis wrote: > Here's version 5. There are a number of fixes, and better tests, and > it's passing in CI. Attached trivial rebase as v6. > The libc hook support is still experimental Patches 0006 and 0007 should still be considered experimental and don't require review right now. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
- v6-0007-Add-test-module-for-libc-collation-provider-hook.patch
- v6-0006-Support-multiple-libc-collation-provider-librarie.patch
- v6-0005-Add-test-module-for-icu-collation-provider-hook.patch
- v6-0004-Support-multiple-ICU-collation-provider-libraries.patch
- v6-0003-Refactor-pg_locale_t-routines.patch
- v6-0002-Add-pg_strxfrm-and-pg_strxfrm_prefix.patch
- v6-0001-Add-pg_strcoll-and-pg_strncoll.patch
On Wed, 2023-01-11 at 15:08 +0100, Peter Eisentraut wrote: > I think the refactoring that you proposed in the thread "Refactor to > introduce pg_strcoll()." was on a sensible track. Maybe we should > try > to get that done. Those should be patches 0001-0003 in this thread (now at v6), which are all pure refactoring. Let's consider those patches the topic of this thread and I'll move 0004-0007 back to the multi-lib ICU thread on the next revision. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Jan 11, 2023 at 3:44 PM Jeff Davis <pgsql@j-davis.com> wrote: > Attached trivial rebase as v6. Some review comments for this v6. Comments on 0001-*: * I think that 0002-* can be squashed with 0001-*, since there isn't any functional reason why you'd want to commit the strcoll() and strxfrm() changes separately. Sometimes it can be useful to break things up, despite the fact that it couldn't possibly make sense to commit just one of the resulting patches on its own. However, I don't think that that's appropriate here. There is no apparent conceptual boundary that you're highlighting by splitting things up like this. strxfrm() and strcoll() are defined in terms of each other -- they're siblings, joined at the hip -- so this seems kinda jarring. * Your commit message for 0001 (and other patches) don't set things up by describing what the point is, and what the work anticipates. I think that they should do that. You're adding a layer of indirection that's going to set things up for later patches that add a layer of indirection for version ICU libraries (and even libc itself), and some of the details only make sense in that context. This isn't just refactoring work that could just as easily have happened in some quite different context. * I'm not sure that pg_strcoll() should be breaking ties itself. We break ties using strcmp() for historical reasons, but must not do that for deterministic ICU collations, which may be obscured. That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't the same as the well known strcoll()/strxfrm() relationship. That kind of makes pg_strcoll() somewhat more than a strcoll() shim, which is inconsistent. Another concern is that the deterministic collation handling isn't handled in any one layer, which would have been nice. Do we need to do things this way? What's it adding? * varstrfastcmp_locale() is no longer capable of calling ucol_strcollUTF8() through the shim interface, meaning that it has to determine string length based on NUL-termination, when in principle it could just use the known length of the string. Presumably this might have performance implications. Have you thought about that? Some comments on 0002-*: * I don't see much point in this new varstr_abbrev_convert() variable: + const size_t max_prefix_bytes = sizeof(Datum); varstr_abbrev_convert() is concerned with packing abbreviated key bytes into Datums, so it's perfectly reasonable to deal with Datums/sizeof(Datum) directly. * Having a separate pg_strxfrm_prefix_libc() function just to throw an error doesn't really add much IMV. Comments on 0003-*: I suggest that some of the very trivial functions you have here (such as pg_locale_deterministic()) be made inline functions. Comments on 0006-*: * get_builtin_libc_library() could be indented in a way that would make it easier to understand. -- Peter Geoghegan
On Fri, 2023-01-13 at 11:57 -0800, Peter Geoghegan wrote: > You're adding a layer of indirection that's going to set things up > for > later patches that add a layer of indirection for version ICU > libraries (and even libc itself), and some of the details only make > sense in that context. This isn't just refactoring work that could > just as easily have happened in some quite different context. Right, well put. I have two goals and felt that they merged into one patchset, but I think that caused more confusion. The first goal I had was simply that the code was really hard to understand and work on, and refactoring was justified to improve the situation. The second goal, which is somewhat dependent on the first goal, is that we really need an ability to support multiple ICU libraries, and I wanted to do some common groundwork that would be needed for any approach we choose there, and provide some hooks to get us there. You are right that this goal influenced the first goal. I attached new patches: v7-0001: pg_strcoll and pg_strxfrm patches combined, your comments addressed v7-0002: add pg_locale_internal.h (and other refactoring) I will post the other patches in the other thread. > That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't > the same as the well known strcoll()/strxfrm() relationship. That's a really good point. I changed tiebreaking to be the caller's responsibility. > * varstrfastcmp_locale() is no longer capable of calling > ucol_strcollUTF8() through the shim interface, meaning that it has to > determine string length based on NUL-termination, when in principle > it > could just use the known length of the string. I think you misread, it still calls ucol_strcollUTF8() when applicable, which is impoartant because otherwise it would require a conversion to a UChar string. ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some basic testing and it doesn't seem like it's slower than using the length. If passing the length is faster for some reason, it would complicate the API because we'd need an entry point that's expecting nul-termination and lengths, which is awkward (as Peter E. pointed out). > * I don't see much point in this new varstr_abbrev_convert() > variable: > > + const size_t max_prefix_bytes = sizeof(Datum); > > varstr_abbrev_convert() is concerned with packing abbreviated key > bytes into Datums, so it's perfectly reasonable to deal with > Datums/sizeof(Datum) directly. I felt it was a little clearer amongst the other code, to a casual reader, but I suppose it's a style thing. I will change it if you insist. > * Having a separate pg_strxfrm_prefix_libc() function just to throw > an > error doesn't really add much IMV. Removed. > Comments on 0003-*: > > I suggest that some of the very trivial functions you have here (such > as pg_locale_deterministic()) be made inline functions. I'd have to expose the pg_locale_t struct, which didn't seem desirable to me. Do you think it's enough of a performance concern to be worth some ugliness there? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Sat, Jan 14, 2023 at 12:03 PM Jeff Davis <pgsql@j-davis.com> wrote: > The first goal I had was simply that the code was really hard to > understand and work on, and refactoring was justified to improve the > situation. > > The second goal, which is somewhat dependent on the first goal, is that > we really need an ability to support multiple ICU libraries, and I > wanted to do some common groundwork that would be needed for any > approach we choose there, and provide some hooks to get us there. You > are right that this goal influenced the first goal. I don't disagree that it was somewhat independent of the first goal. I just think that it makes sense to "round up to fully dependent". Basically it's not independent enough to be worth talking about as an independent thing, just as a practical matter - it's confusing at the level of things like the commit message. There is a clear direction that you're going in here from the start, and your intentions in 0001 do matter to somebody that's just looking at 0001 in isolation. That is my opinion, at least. The second goal is a perfectly good enough goal on its own, and one that I am totally supportive of. Making the code clearer is icing on the cake. > ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some > basic testing and it doesn't seem like it's slower than using the > length. If passing the length is faster for some reason, it would > complicate the API because we'd need an entry point that's expecting > nul-termination and lengths, which is awkward (as Peter E. pointed > out). That's good. I'm happy to leave it at that. I was only enquiring. > I felt it was a little clearer amongst the other code, to a casual > reader, but I suppose it's a style thing. I will change it if you > insist. I certainly won't insist. > I'd have to expose the pg_locale_t struct, which didn't seem desirable > to me. Do you think it's enough of a performance concern to be worth > some ugliness there? I don't know. Quite possibly not. It would be nice to have some data on that, though. -- Peter Geoghegan
On Tue, 2023-01-17 at 14:18 -0800, Peter Geoghegan wrote: > The second goal is a perfectly good enough goal on its own, and one > that I am totally supportive of. Making the code clearer is icing on > the cake. Attached v8, which is just a rebase. To reiterate: commitfest entry https://commitfest.postgresql.org/41/3956/ is dependent on these patches and is a big part of the motivation for refactoring. > > I don't know. Quite possibly not. It would be nice to have some data > on that, though. I tested with hash aggregation, which might be more dependent on pg_locale_deterministic() than sorting. I didn't see any significant difference between master and the refactoring branch, so I don't see a need to make that function "inline". I also re-tested sorting and found some interesting results for en-US- x-icu on a UTF-8 database (which is I suspect one of the most common configurations for ICU): * the refactoring branch is now more than 5% faster, whether using abbreviated keys or not * disabling abbreviated keys makes sorting 8-10% faster on both master and the refactoring branch Both of these are surprising, and I haven't investigated deeply yet. Maybe something about LTO, some intervening patch, or I just made some mistakes somewhere (I did this fairly quickly). But as of now, it doesn't look like the refactoring patch hurts anything. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Fri, 2023-01-20 at 12:54 -0800, Jeff Davis wrote: > Both of these are surprising, and I haven't investigated deeply yet. It's just because autoconf defaults to -O2 and meson to -O3, at least on my machine. It turns out that, at -O2, master and the refactoring branch are even; but at -O3, both get faster, and the refactoring pulls ahead by a few percentage points. At least that's what's happening for en-US-x-icu on UTF-8 with my test data set. I didn't see much of a difference in other situations, but I didn't retest those other situations this time around. We should still look into why disabling abbreviated keys improves performance in some cases. Maybe we need a GUC for that? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attached v9 and added perf numbers below. I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two, please let me know if you want me to hold off. (I won't commit the GUCs unless others find them generally useful; they are included here to more easily reproduce my performance tests.) My primary motivation is still related to https://commitfest.postgresql.org/41/3956/ but the combination of cleaner code and a performance boost seems like reasonable justification for this patch set independently. There aren't any clear open items on this patch. Peter Eisentraut asked me to focus this thread on the refactoring, which I've done by reducing it to 2 patches, and I left multilib ICU up to the other thread. He also questioned the increased line count, but I think the currently-low line count is due to bad style. PeterG provided some review comments, in particular when to do the tiebreaking, which I addressed. This patch has been around for a while, so I'll take a fresh look and see if I see risk areas, and re-run a few sanity checks. Of course more feedback would also be welcome. PERFORMANCE: ====== Setup: ====== base: master with v9-0001 applied (GUCs only) refactor: master with v9-0001, v9-0002, v9-0003 applied Note that I wasn't able to see any performance difference between the base and master, v9-0001 just adds some GUCs to make testing easier. glibc 2.35 ICU 70.1 gcc 11.3.0 LLVM 14.0.0 built with meson (uses -O3) $ perl text_generator.pl 10000000 10 > /tmp/strings.utf8.txt CREATE TABLE s (t TEXT); COPY s FROM '/tmp/strings.utf8.txt'; VACUUM FREEZE s; CHECKPOINT; SET work_mem='10GB'; SET max_parallel_workers = 0; SET max_parallel_workers_per_gather = 0; ============= Test queries: ============= EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "C"; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en_US"; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; Timings are measured as the milliseconds to return the first tuple from the Sort operator (as reported in EXPLAIN ANALYZE). Median of three runs. ======== Results: ======== base refactor speedup sort_abbreviated_keys=false: C 7377 7273 1.4% en_US 35081 35090 0.0% en-US-x-ixu 20520 19465 5.4% sort_abbreviated_keys=true: C 8105 8008 1.2% en_US 35067 34850 0.6% en-US-x-icu 22626 21507 5.2% =========== Conclusion: =========== These numbers can move +/-1 percentage point, so I'd interpret anything less than that as noise. This happens to be the first run where all the numbers favored the refactoring patch, but it is generally consistent with what I had seen before. The important part is that, for ICU, it appears to be a substantial speedup when using meson (-O3). Also, when/if the multilib ICU support goes in, that may lose some of these gains due to an extra indirect function call. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 27.01.23 00:47, Jeff Davis wrote: > I'm hoping to commit 0002 and 0003 soon-ish, maybe a week or two, > please let me know if you want me to hold off. (I won't commit the GUCs > unless others find them generally useful; they are included here to > more easily reproduce my performance tests.) I have looked a bit at 0002 and 0003. I like the direction. I'll spend a bit more time reviewing it in detail. It moves a lot of code around. I don't know to what extent this depends on the abbreviated key GUC discussion. Does the rest of this patch set depend on this?
On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote: > I don't know to what extent this depends on the abbreviated key GUC > discussion. Does the rest of this patch set depend on this? The overall refactoring is not dependent logically on the GUC patch. It may require some trivial fixup if you eliminate the GUC patch. I left it there because it makes exploring/testing easier (at least for me), but the GUC patch doesn't need to be committed if there's no consensus. Regards, Jeff Davis
On 01.02.23 00:33, Jeff Davis wrote: > On Tue, 2023-01-31 at 11:40 +0100, Peter Eisentraut wrote: >> I don't know to what extent this depends on the abbreviated key GUC >> discussion. Does the rest of this patch set depend on this? > > The overall refactoring is not dependent logically on the GUC patch. It > may require some trivial fixup if you eliminate the GUC patch. > > I left it there because it makes exploring/testing easier (at least for > me), but the GUC patch doesn't need to be committed if there's no > consensus. I took another closer look at the 0002 and 0003 patches. The commit message for 0002 says "Also remove the TRUST_STRXFRM define", but I think this is incorrect, as that is done in the 0001 patch. I don't like that the pg_strnxfrm() function requires these kinds of repetitive error checks: + if (rsize != bsize) + elog(ERROR, "pg_strnxfrm() returned unexpected result"); This could be checked inside the function itself, so that the callers don't have to do this themselves every time. I don't really understand the 0003 patch. It's a lot of churn but I'm not sure that it achieves more clarity or something. The new function pg_locale_deterministic() seems sensible. Maybe this could be proposed as a separate patch. I don't understand the new header pg_locale_internal.h. What is "internal" and what is not internal? What are we hiding from whom? There are no code comments about this AFAICT. pg_locale_struct has new fields + char *collate; + char *ctype; that are not explained anywhere. I think this patch would need a bit more explanation and commenting.
New version attached. Changes: * I moved the GUC patch to the end (so you can ignore it if it's not useful for review) * I cut out the pg_locale_internal.h rearrangement (at least for now, it might seem useful after the dust settles on the other changes). * I added a separate patch for pg_locale_deterministic(). * I added a separate patch for a simple cleanup of a USE_ICU special case. Now the patches are: 0001: pg_strcoll/pg_strxfrm 0002: pg_locale_deterministic() 0003: cleanup a USE_ICU special case 0004: GUCs (only for testing, not for commit) Responses to your review comments inline below: On Mon, 2023-02-13 at 11:35 +0100, Peter Eisentraut wrote: > The commit message for 0002 says "Also remove the TRUST_STRXFRM > define", > but I think this is incorrect, as that is done in the 0001 patch. Fixed. > I don't like that the pg_strnxfrm() function requires these kinds of > repetitive error checks: > > + if (rsize != bsize) > + elog(ERROR, "pg_strnxfrm() returned unexpected > result"); > > This could be checked inside the function itself, so that the callers > don't have to do this themselves every time. The current API allows for a pattern like: /* avoids extra work if existing buffer is big enough */ len = pg_strxfrm(buf, src, bufSize, loc); if (len >= bufSize) { buf = repalloc(len+1); bufSize = len+1; len2 = pg_strxfrm(buf, src, bufSize, loc); } The test for rsize != bsize are just there to check that the underlying library calls (strxfrm or getSortKey) behave as documented, and we expect that they'd never be hit. It's hard to move that kind of check into pg_strxfrm() without making it also manage the buffers. Do you have a more specific suggestion? I'd like to keep the API flexible enough that the caller can manage the buffers, like with abbreviated keys. Perhaps the check can just be removed if we trust that the library functions at least get the size calculation right? Or turned into an Assert? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Mon, 2023-02-13 at 15:45 -0800, Jeff Davis wrote: > New version attached. Changes: These patches, especially 0001, have been around for a while, and they've received some review attention with no outstanding TODOs that I'm aware of. I plan to commit v10 (or something close to it) soon unless someone has additional feedback. -- Jeff Davis PostgreSQL Contributor Team - AWS
On 14.02.23 00:45, Jeff Davis wrote: > Now the patches are: > > 0001: pg_strcoll/pg_strxfrm > 0002: pg_locale_deterministic() > 0003: cleanup a USE_ICU special case > 0004: GUCs (only for testing, not for commit) I haven't read the whole thing again, but this arrangement looks good to me. I don't have an opinion on whether 0004 is actually useful.
On Wed, 2023-02-22 at 20:49 +0100, Peter Eisentraut wrote: > > On 14.02.23 00:45, Jeff Davis wrote: > > > > Now the patches are: > > > > > > > > 0001: pg_strcoll/pg_strxfrm > > > > 0002: pg_locale_deterministic() > > > > 0003: cleanup a USE_ICU special case > > > > 0004: GUCs (only for testing, not for commit) > > > > I haven't read the whole thing again, but this arrangement looks > > good > to > > me. I don't have an opinion on whether 0004 is actually useful. Committed with a few revisions after I took a fresh look over the patch. The most significant was that I found out that we are also hashing the NUL byte at the end of the string when the collation is non- deterministic. The refactoring patch doesn't change that of course, but the API from pg_strnxfrm() is more clear and I added comments. Also, ICU uses int32_t for string lengths rather than size_t (I'm not sure that's a great idea, but that's what ICU does). I clarified the boundary by changing the argument types of the ICU-specific static functions to int32_t, while leaving the API entry points as size_t. -- Jeff Davis PostgreSQL Contributor Team - AWS