Re: Collation version tracking for macOS - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Collation version tracking for macOS |
Date | |
Msg-id | c4fda90ec6a7568a896f243a38eb273c3b5c3d93.camel@j-davis.com Whole thread Raw |
In response to | Re: Collation version tracking for macOS (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Collation version tracking for macOS
|
List | pgsql-hackers |
On Tue, 2022-12-06 at 10:33 +1300, Thomas Munro wrote: > OK, I'm going to see what happens if I try to wrangle that stuff into > a new catalogue table. I've been hacking on a major refactor of the locale-related code. I attached my progress and I think several patches are ready. The main motivation is that I was frustrated by the special cases everywhere. I wanted to make it easier to hack on this code going forward, now that we are adding even more complexity for multiple ICU libraries. I'm posting to this thread (rather than my previous refactoring thread[1]) because a lot of the people interested in working on this code are here. So, if you like (or don't like) the structure of these changes, please let me know. Changes: * Introduce pg_locale_internal.h to hide all USE_ICU code, including all callers of the ICU routines. The files that still need to include pg_locale_internal.h are: - pg_locale.c - regc_pg_locale.c - formatting.c - like.c - like_support.c - collationcmds.c * Other callers (in files that don't include pg_locale_internal.h) don't need to branch based on the provider, platform, database encoding, USE_ICU, HAVE_LOCALE_T, etc. * ICU and libc are treated the same way in more places. * I made it so pg_locale_t is constructed first, then moved to TopMemoryContext, so that it won't leak in TopMemoryContext if errors are encountered. * Introduce pg_strcoll, pg_strncoll, pg_strxfrm, and pg_strnxfrm so that varlena/hash/verchar code doesn't worry about the details. * Add method structure pg_icu_library, borrowed from Thomas's patch, that provides one convenient place to provide multiple-ICU-library support. * Add a hook that allows you to fill in the pg_icu_library structure however you want while a pg_locale_t is being constructed. This allows do-it-yourself ICU library lockdown. On the negative side, it increases the line count. Part of that is because adding indirection for the ICU library is just more lines of code, but a lot of it is just that I used a lot of smaller functions. Perhaps my style is a bit verbose? Even though we're close to consensus on how we should offer control over the ICU libraries, having the hook may be useful for experimentation, testing, or as a last resort. Right now the hook has limited information to use to find the right library -- just the ICU collation name and the version, because that's what we have in the catalog. But I assume the patch Thomas is working on will change that. Performance: I did brief performance sanity tests on several paths and the results are unremarkable (which is generally good for a refactor). On the path I was watching most closely, ICU/UTF8/en-US-x-icu, it came in about 2% faster, which was a pleasant surprise. This was true both when I disabled abbreviated keys (to stress localized comparison paths) and also with abbreviated keys enabled. My previous refactoring work[1] ended up a percent or two slower. My guess right now is that I moved some code around after I noticed that ICU accepts NUL-terminated strings (by specifying the lenght as -1), and that helped. But I'll need to profile and look more closely to be more certain of my results, these are preliminary. There are a few things that could be done differently: * I am still a bit confused about why someone would want a collation with a different lc_collate and lc_ctype in libc; and assuming there is a reason, why it can't be done with ICU. The way I did the refactoring tries to accommodate them as different concepts, but I can rip that out. * In theory, we could also support multilib libc, and an associated get_libc_library() and hook, but there are a couple big challenges. Firstly, if it's the default locale, it relies on setlocale(), so we'd have to figure out what to do about that. Second, having an a second version of glibc on your system is not as normal or trivial as having a second version of ICU. * I made the hook simple, but all it can do is replace the ICU library. It's possible that it would want to construct it's own entire pg_locale_t for some reason, and keep more complex state in a private pointer, or something like that. It seemed better to keep it simple, but maybe someone would want more flexibility there? * I used the library indirection for pretty much all ICU calls, including the ucnv_ and the uloc_ functions. I did this mainly because, if we are so paranoid about ICU changing in subtle ways, we might as well make it possible to lock down everything. I can rip this out, too, but it didn't add many lines. Loose ends: * I need to do something with get_collation_actual_version. I had an earlier iteration that went through pg_newlocale() and then queried the resulting pg_locale_t structure, but that changed the error paths in a way that failed a couple tests, so I left that out. * Error paths could be improved further to make sure that libc locale_t and UCollator structures are freed in error paths during construction. I was thinking about using resowner for this, and then if the pg_locale_t structure gets moved to TopMemoryContext, just doing a ResourceOwnerForget. Alternatively, I could just be careful about the error paths. * We'd need to adapt this and make sure it works with whatever scheme we decide is best for finding the right library. I suspect this would just be adding another parameter to get_icu_library (and the hook) to represent the new collation provider Oid (and get_icu_library could use that to look up the library names and load them). Comments welcome. [1] https://www.postgresql.org/message-id/99aa79cceefd1fe84fda23510494b8fbb7ad1e70.camel@j-davis.com -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
pgsql-hackers by date: