Home > mailing lists

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

From	Jeff Davis
Subject	Re: Built-in CTYPE provider
Date	March 13, 2024 07:44:37
Msg-id	846a7e1fa2024918b58ff7583523a2f3de9a11b4.camel@j-davis.com Whole thread Raw
In response to	Re: Built-in CTYPE provider (Peter Eisentraut <peter@eisentraut.org>)
Responses	Re: Built-in CTYPE provider
List	pgsql-hackers

Tree view

On Tue, 2024-03-12 at 09:24 +0100, Peter Eisentraut wrote:
> In the documentation, let's make the list of locale providers an
> actual
> list instead of a sequence of <sect3>s.

Done.

> We had some discussion on initdb option --builtin-locale and whether
> it
> should be something more general.  I'm ok with leaving it like this
> for
> now and maybe consider as an "open item" for PG17.

OK.

> In
>
>      errmsg("parameter \"locale\" must be specified")
>
> make "locale" a placeholder.  (See commit 36a14afc076).

Done.

> It seems the builtin provider accepts both "C" and "POSIX" as locale
> names, but the documentation says it must be "C".  Maybe we don't
> need
> to accept "POSIX"?  (Seeing that there are no plans for "POSIX.UTF-
> 8",
> maybe we just ignore the "POSIX" spelling altogether?)

Agreed, removed "POSIX".

> Speaking of which, the code in postinit.c is inconsistent in that
> respect with builtin_validate_locale().  Shouldn't postinit.c use
> builtin_validate_locale(), to keep it consistent?

Agreed, done.

> Or, there could be a general function that accepts a locale provider
> and
> a locale string and validates everything together?

That's a good idea -- perhaps a separate cleanup patch?

> In initdb.c, this message
>
> printf(_("The database cluster will be initialized with no
> locale.\n"));
>
> sounds a bit confusing.  I think it's ok to show "C" as a locale. 
> I'm
> not sure we need to change the logic here.

Agreed, removed.

> Also in initdb.c, this message
>
> pg_fatal("locale must be specified unless provider is libc");
>
> should be flipped around, like
>
> locale must be specified if provider is %s

Done.

> In pg_dump.c, dumpDatabase(), there are some new warning messages
> that
> are not specifically about the builtin provider.  Are those existing
> deficiencies?  It's not clear to me.

I wouldn't call that a deficiency, but it seemed to be a convenient
place to do some extra sanity checking along with the minor
reorganization I did in that area.

> What are the changes in the pg_upgrade test about?  Maybe explain the
> scenario it is trying to test briefly?

It's trying to be a better test for commit 9637badd9f, which eliminates
needless locale incompatibilities when performing a pg_upgrade.

At the time of that commit, the options for testing were fairly
limited, so I'm just expanding on that here a bit. It might be slightly
over-engineered? I added some comments and cleaned it up.

> > 0004: Inline some UTF-8 functions to improve performance
>
> Makes sense that inlining can be effective here.  But why aren't you
> just inlining the existing function pg_utf_mblen()?  Now we have two
> functions that do the same thing.  And the comment at pg_utf_mblen()
> is
> removed completely, so it's not clear anymore why it exists.

I was trying to figure out what to do about USE_PRIVATE_ENCODING_FUNCS.

If libpq exports pg_utf_mblen(), it needs to continue to export that,
or else it's an ABI break, right? So that means we need at least one
extern copy of the function. See b6c7cfac88.

Though now that I look at it, I'm not even calling the inlined version
from my code -- I must have been using it in an earlier version and now
not. So I just left pg_utf_mblen() alone, and inlined unicode_to_utf8()
and utf8_to_unicode().

> > 0005: Add a unicode_strtitle() function and move the implementation
> > for
> > the builtin provider out of formatting.c.
>
> In the recent discussion you had expression some uncertainty about
> the
> detailed semantics of this.  INITCAP() was copied from Oracle, so we
> could check there for reference, too.  Or we go with full Unicode
> semantics.  I'm not clear on all the differences and tradeoffs, if
> there
> are any.  In any case, it would be good if there were documentation
> or a
> comment that somehow wrote down the resolution of this.

There are a few nuances that are different between the Unicode way to
titlecase a string and INITCAP():

  1. For the initial character in a word, Unicode uses the titlecase
mapping, whereas INITCAP (as the name suggests) uses the uppercase
mapping.
  2. Unicode uses full case mapping, which can change the length of the
string (e.g. mapping "ß" to the titlecase "Ss" -- though I've heard
that titlecasing "ß" doesn't make a lot of sense in German because
words typically don't begin with it). Full case mapping can also handle
context-sensitive mappings, such as the "final sigma".
  3. Unicode has a lot to say about word boundaries, whereas INITCAP()
just uses the boundary between alnum and !alnum.

The unicode_strtitle() function is just a way to unify those
differences into one implementation. A "full" parameter controls
behaviors 1 & 2, and a callback handles 3. If we just want to keep it
simple, we can leave it as the character-by-character algorithm in
formatting.c.

My uncertainty was whether we really want INITCAP to be doing these
more sophisticated titlecasing transformations, or whether that should
be a separate sql function (title()? titlecase()?), or whether we just
don't need that functionality.

New series attached. I plan to commit 0001 very soon.

Regards,
    Jeff Davis

Attachment

pgsql-hackers by date:

From: "Anton A. Melnikov"
Date: 13 March 2024, 07:41:45
Subject: Re: Add the ability to limit the amount of memory that can be allocated to backends.

From: "Hayato Kuroda (Fujitsu)"
Date: 13 March 2024, 07:47:52
Subject: RE: speed up a logical replica setup

Re: Built-in CTYPE provider - Mailing list pgsql-hackers

Attachment

Previous

Next