Re: Rework of collation code, extensibility - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Rework of collation code, extensibility
Date
Msg-id 052a5ed874d110be2f3ae28752e363306b10966d.camel@j-davis.com
Whole thread Raw
In response to Re: Rework of collation code, extensibility  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Rework of collation code, extensibility
List pgsql-hackers
On Fri, 2023-01-13 at 11:57 -0800, Peter Geoghegan wrote:
> You're adding a layer of indirection that's going to set things up
> for
> later patches that add a layer of indirection for version ICU
> libraries (and even libc itself), and some of the details only make
> sense in that context. This isn't just refactoring work that could
> just as easily have happened in some quite different context.

Right, well put. I have two goals and felt that they merged into one
patchset, but I think that caused more confusion.

The first goal I had was simply that the code was really hard to
understand and work on, and refactoring was justified to improve the
situation.

The second goal, which is somewhat dependent on the first goal, is that
we really need an ability to support multiple ICU libraries, and I
wanted to do some common groundwork that would be needed for any
approach we choose there, and provide some hooks to get us there. You
are right that this goal influenced the first goal.

I attached new patches:

  v7-0001: pg_strcoll and pg_strxfrm patches combined, your comments
addressed
  v7-0002: add pg_locale_internal.h (and other refactoring)

I will post the other patches in the other thread.

> That means that pg_strcoll()'s relationship to pg_strxfrm()'s isn't
> the same as the well known strcoll()/strxfrm() relationship.

That's a really good point. I changed tiebreaking to be the caller's
responsibility.

> * varstrfastcmp_locale() is no longer capable of calling
> ucol_strcollUTF8() through the shim interface, meaning that it has to
> determine string length based on NUL-termination, when in principle
> it
> could just use the known length of the string.

I think you misread, it still calls ucol_strcollUTF8() when applicable,
which is impoartant because otherwise it would require a conversion to
a UChar string.

ucol_strcollUTF8() accepts -1 to mean "nul-terminated". I did some
basic testing and it doesn't seem like it's slower than using the
length. If passing the length is faster for some reason, it would
complicate the API because we'd need an entry point that's expecting
nul-termination and lengths, which is awkward (as Peter E. pointed
out).

> * I don't see much point in this new varstr_abbrev_convert()
> variable:
>
> +   const size_t max_prefix_bytes = sizeof(Datum);
>
> varstr_abbrev_convert() is concerned with packing abbreviated key
> bytes into Datums, so it's perfectly reasonable to deal with
> Datums/sizeof(Datum) directly.

I felt it was a little clearer amongst the other code, to a casual
reader, but I suppose it's a style thing. I will change it if you
insist.

> * Having a separate pg_strxfrm_prefix_libc() function just to throw
> an
> error doesn't really add much IMV.

Removed.

> Comments on 0003-*:
>
> I suggest that some of the very trivial functions you have here (such
> as pg_locale_deterministic()) be made inline functions.

I'd have to expose the pg_locale_t struct, which didn't seem desirable
to me. Do you think it's enough of a performance concern to be worth
some ugliness there?


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Extracting cross-version-upgrade knowledge from buildfarm client
Next
From: Tom Lane
Date:
Subject: Re: Extracting cross-version-upgrade knowledge from buildfarm client